Adaptive Speech Streaming Based on Speech Quality Estimation and Artificial Bandwidth Extension for Voice over Wireless Multimedia Sensor Networks

Abstract

In this paper, an adaptive speech streaming method is proposed to improve the perceived speech quality (PSQ) of voice over wireless multimedia sensor network (WMSNs). First of all, the proposed method estimates the PSQ of the received speech data under different network conditions that are represented by the packet loss rates (PLRs). Simultaneously, the proposed method classifies the speech signal as either an onset or a nononset frame. Based on the estimated PSQ and the speech class, it determines an appropriate bit rate for the redundant speech data (RSD) that are transmitted with the primary speech data (PSD) to help reconstruct the speech signals of any lost frames. In particular, when the estimated PLR is high, the bit rate of the RSD should be increased by decreasing that of the PSD. Thus, the bandwidth of the PSD is changed from wideband to narrowband, and an artificial bandwidth extension technique is applied to the decoded narrowband speech. It is shown from the simulation that the proposed method significantly improves the decoded speech quality under packet loss conditions in a WMSN, compared to a decoder-based packet loss concealment method and a conventional redundant speech transmission method.

1. Introduction

Because of the rapid development of low power and highly integrated digital electronic technologies, wireless multimedia sensor networks (WMSNs) are capable of retrieving audio and/or video streams as they interconnect sensor nodes equipped with multimedia devices such as cameras and microphones. Accordingly, they provide a wide range of potential applications needed to access audio or video data in real time such as environmental monitoring, human tracking, and security systems [1, 2]. However, it is difficult to guarantee seamless audio or video quality because those multimedia data are usually generated at a much higher bit rate than other sensor data. Furthermore, the reliability of transmission over WMSNs is apt to be degraded due to various resource constraints in WMSNs, compared to other networks [3, 4]. Thus, the quality of service (QoS) of multimedia streaming over WMSNs has become even more important. Specifically, applications of voice over WMSN (VoWMSN) require a minimum level of perceived speech quality (PSQ) [5–7].

To improve the PSQ of voice applications, various speech streaming methods have been proposed for use on IP networks. These methods are typically classified into either sender-based schemes or receiver-based schemes. Sender-based schemes consist of a collection of packet loss protection methods that provide error-robust transmission methods such as interleaving, forward error correction (FEC), and redundant speech transmission (RST) [8–12]. On the other hand, receiver-based schemes consist of a collection of packet loss concealment (PLC) methods that compensate for lost speech signals using substitutable signals, for example, silence, previous good speech, or regenerated speech according to the analysis-by-synthesis criterion [13–16]. However, these two schemes can complement each other. That is, sender-based schemes are robust to higher packet loss rates (PLRs) because they often use redundant information to recover lost signals, which results in increased transmission bandwidth. Receiver-based schemes do not need to increase the transmission bandwidth because they conceal lost signals without using redundant information from the sender side; however, it is hard to prevent rapid PSQ degradation under high PLRs. It was also reported in [5] that in VoWMSN applications the receiver-based scheme could accommodate higher bit rates of speech coding under low PLR conditions, whereas the sender-based scheme was suitable for dealing with lower bit rates under high PLR conditions.

To take advantage of both schemes, a new method has been proposed in [17], which transmitted redundant speech data (RSD) adaptively according to the estimated PSQ under the current PLR condition and determined a suitable RST mode. On the basis of the mode, it generated bitstreams of primary speech data (PSD) and RSD using a scalable speech coder so as to maintain the equivalent transmission bandwidth. A lost speech signal was then recovered using the RSD for a high PLR. As a result, this method provided the improved overall PSQ under various PLRs within equivalent transmission bandwidth. Despite the advantages, this method suffered from degraded PSQ when the speech signal bandwidth changed from wideband to narrowband due to the decreased bit rate of PSD by assigning more bit rate to RSD. To overcome this problem, the proposed method in this paper incorporates an artificial bandwidth extension (ABE) technique [18] to the decoded narrowband speech to prevent the quality of the decoded speech from being degraded by the bandwidth deficiency of speech. In addition, a PSQ estimation method is proposed for the determination of an appropriate RST mode as well as the speech classification.

The remainder of this paper is organized as follows. Section 2 describes the overall procedure and packet flow of the VoWMSN, which employs the proposed adaptive speech streaming method. Then, Section 3 proposes an adaptive speech streaming method for VoWMSNs. Section 4 evaluates the performance of the proposed method and compares it with those of a decoder-based PLC method and a conventional RST method. Finally, Section 5 concludes this paper.

2. Voice over WMSN Based on Adaptive Speech Streaming

2.1. Overall Structure

Figure 1 shows a block diagram and packet flow for a VoWMSN system that employs the proposed adaptive speech streaming method. As the speech signal, $P S D (n)$ , comes into an input device at the sender side, it is classified as either an onset frame or a nononset frame. The classification result, together with the estimated PSQ that is delivered from the receiver side, is then used to determine a suitable RST mode. Next, the bitstreams of the PSD and RSD, $\bar{P S D} (n)$ and $\bar{R S D} (n)$ , are generated using a scalable speech encoder according to the determined RST mode. After that, $\bar{P S D} (n)$ and $\bar{R S D} (n)$ are combined with the real-time transport protocol (RTP) payload format to obtain an RTP packet, $\bar{P K T} (n)$ , which is transmitted to the receiver side over WMSN.

Figure 1

Block diagram and packet flow of a VoWMSN system employing the proposed adaptive speech streaming method.

Meanwhile, at the receiver side, $\bar{P S D} (n)$ is depacketized from the RTP payload. In certain RST modes, if any $\bar{R S D} (n)$ exists in the payload, it is stored for use with any potential upcoming packet losses. The extracted $\bar{P S D} (n)$ is then decoded into $\bar{\bar{P S D}} (n)$ using a scalable speech decoder. Finally, $\bar{\bar{P S D}} (n)$ is sent to the output device. Note that if $\bar{P S D} (n)$ is only composed of a narrowband speech bitstream, the decoded narrowband speech signal, ${\bar{\bar{P S D}}}_{N B} (n)$ , is artificially extended to a wideband speech signal to maintain a seamless PSQ. Since the PSQ estimation method (which will be described in Section 3.1) accepts only narrowband speech, ${\bar{\bar{P S D}}}_{N B} (n)$ is also used for the PSQ estimation.

2.2. RTP Payload Format

As mentioned above, the proposed adaptive speech streaming method can use an indicator for scalable bit rate speech coding. To deliver the estimated PSQ from the receiver side to the sender side, there should be a reserved field to accommodate the transmission of both the estimated PSQ and the RSD bitstream. To this end, the RTP payload format defined in IETF RFC 4749 [19, 20], which is shown in Figure 2(a), is modified as shown in Figure 2(b).

Figure 2

Comparison of RTP payload formats: (a) the format defined in IETF RFC 4749 and (b) a modified format for the proposed method.

As shown in Figure 2(a), the “ $M B S | F T$ ” sequence contains the payload header. In other words, the four-bit maximum bit rate supported (MBS) field is used to inform the sender side of the maximum bit rate that can be received. In the ITU-T Recommendation G.729.1 speech coder [21] that will be used to implement a VoWMSN system described in Section 4, the MBS is assigned a value from 0 to 11, corresponding to an encoding bit rate of 8, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, or 32 kbit/s, respectively. In addition, the frame-type (FT) field, which consists of four bits, indicates the actual encoding bit rate of the contained bitstreams. Thus, this field is also assigned a value from 0 to 11, corresponding to one of the encoding bit rates between 8 and 32 kbit/s. It should be noted that value 15 indicates a condition in which there is no data to be transmitted while values 12 to 14 are reserved for future use.

On the other hand, in the modified RTP payload format, two fields (such as FT = 12 and FT = 13) are added for indicating the RSD bitstream and the estimated PSQ, respectively, as shown in Figure 2(b). Moreover, the main field for speech frames (as in Figure 2(a)) is split into three fields representing the PSD bitstream, the RSD bitstream, and the estimated PSQ, respectively.

3. Proposed Adaptive Speech Streaming Method

3.1. Speech Quality Estimation

The proposed adaptive speech streaming method begins by estimating the PSQ because the PSQ is a good indicator of both the current PLR and the bit rate of speech coding. To this end, the ITU-T Recommendation P.563 [22] is employed in this paper as an objective PSQ assessment method in order to monitor the PSQ of VoWMSN, and it estimates the PSQ as a mean opinion score (MOS) without using a reference speech signal. The proposed method requires that the PSQ estimate should be done in real time, thus the ITU-T Recommendation P.563 needs to be modified to have low-delay requirements, which is referred to as a nonintrusive perceived speech quality assessment (LD-QA) method in this paper.

Figure 3 shows an overall structure of the PSQ estimation method using three processing stages such as the preprocessing stage, the distortion estimation stage, and the perceptual mapping stage. To take into account various distortion factors during speech streaming, the model also combines three processing modules in the distortion estimation stage, which is based on the ITU-T Recommendation P.563. The first module models the vocal tract as a series of tubes with abnormal variations for degradation modeling and estimates the linear prediction coefficients (LPCs) within a certain range expected for a natural speech signal. The second module reconstructs a clean reference speech signal from the degraded speech signal and then evaluates the difference between the reconstructed clean speech and the degraded speech signal. The third module identifies and estimates specific distortions expected to be encountered in transmission channels.

Figure 3

Overall structure of the PSQ estimation method using three processing stages, where the distortion estimation stage is based on the ITU-T Recommendation P.563.

In the perceptual mapping stage, the distortion effects estimated in the distortion estimation stage are linearly combined to estimate an MOS, which is denoted by $\tilde{Q} (n)$ in Figure 3. The PSQ estimation method described so far provides a single MOS for an input speech file whose length should be longer than approximately 4 s [22].

On the other hand, the proposed LD-QA method modifies the second stage so that the distortion effects are modeled using a minimal amount of speech data. In particular, each pitch mark in the first module of the second stage is extracted once every frame, where the frame size is 64 ms long, and the second module is also applied once every frame. In addition, the distortion-specific parameters for noise detection, temporal time clipping, and robotization are updated once every frame using speech signals of 500 ms, 64 ms, and 1 s long, respectively. Consequently, $\tilde{Q} (n)$ is calculated once every frame, which results in a processing delay of 20 ms for the LD-QA method.

3.2. Artificial Bandwidth Extension

When the estimated PLR is high, the bit rate of the RSD should be increased by decreasing the bit rate of the PSD, which is realized by changing the bandwidth of the PSD from wideband to narrowband. In this case, an ABE technique is used to overcome the performance degradation of the seamless PSQ by extending the bandwidth of speech signals from narrowband to wideband to improve the speech quality of the narrowband speech.

Figure 4 shows a block diagram of the ABE technique operated in the modified discrete cosine transform (MDCT) domain [18]. In this figure, if the RST mode is 2, narrowband speech, ${\bar{\bar{P S D}}}_{N B} (n)$ , is segmented into a sequence of frames with a frame size of N. Next, each analysis frame is transformed into the frequency domain using a $2 N$ -point MDCT, ${\bar{\bar{P S D}}}_{N B} (k)$ . Next, the ABE method is applied to obtain high-band MDCT coefficients, ${\bar{\bar{P S D}}}_{H B_o n l y} (k)$ . Specifically, the ABE method first extends the 4–4.6 kHz band, ${\bar{\bar{P S D}}}_{4 - 4.6} (k)$ , using the harmonic spectral band replication and correlation-based replication techniques [18]. After that, it extends the 4.6–7 kHz band, ${\bar{\bar{P S D}}}_{4.6 - 7} (k)$ , using a spectral folding technique. Subsequently, ${\bar{\bar{P S D}}}_{H B_o n l y} (k)$ is generated using the extended signals of ${\bar{\bar{P S D}}}_{4 - 4.6} (k)$ and ${\bar{\bar{P S D}}}_{4.6 - 7} (k)$ . Next, the low-band and high-band signals in the time domain, ${\bar{\bar{P S D}}}_{N B} (n)$ and ${\bar{\bar{P S D}}}_{H B_o n l y} (n)$ , are obtained by applying a $2 N$ -point inverse MDCT (IMDCT). Finally, the bandwidth extended signal in the time domain, $\bar{\bar{P S D}} (n)$ , is obtained from the quadrature mirror filterbanks (QMFs).

Figure 4

Block diagram of an artificial bandwidth extension technique applied to decoded narrowband speech.

3.3. RST Mode Decision and Bit Rate Assignment

In the proposed adaptive speech streaming method, the PSQ is estimated at the receiver side using the LD-QA method and then it is sent back to the sender side to make the RST mode decision. Figure 5 shows a block diagram of the RST mode decision based on the speech class and the estimated PSQ in the proposed method.

Figure 5

Block diagram of the RST mode decision based on the speech class and the estimated PSQ in the proposed adaptive speech streaming method.

First, each frame is classified into one of six different classes, namely, silence/background noise, stationary unvoiced, nonstationary unvoiced, speech onset, nonstationary voiced, or stationary voiced [23]. Next, a preliminary experiment is carried out to investigate the relationship between each class and the RST mode under different PLR conditions. Consequently, it is found that the RST mode is most sensitive to the speech onset class. Thus, the proposed adaptive speech streaming method decides only whether or not the nth frame is primarily made up of speech onset, such that

\begin{matrix} c (n) = \{\begin{cases} 1, & if  PSD (n) is  onset, \\ 0, & otherwise . \end{cases} \end{matrix}

(1)

Next, the RST mode,

M (n)

, is determined using both

\tilde{Q} (n)

described in Section 3.1 and

c (n)

in (1) as

\begin{matrix} M (n) = \{\begin{cases} 0, & if \tilde{Q} (n) \geq θ_{Q_{2}}, \\ 1, & if θ_{Q_{1}} \leq \tilde{Q} (n) < θ_{Q_{2}}, c (n) = 0, \\ 2, & otherwise, \end{cases} \end{matrix}

(2)

where

θ_{Q_{1}}

and

θ_{Q_{2}}

are the two predefined thresholds for estimating the speech quality degradation for the current PLR.

Figure 6 shows several bit rate assignments for the PSD and RSD bitstreams according to different RST modes, $M (n)$ . As shown in the figure, if $M (n) = 0$ , $\bar{P S D} (n)$ is composed of the nth speech bitstream encoded at the highest bit rate, $R_{P 0}$ , with no RSD bitstream, which is denoted by $\bar{P S D} (n) = [P S D (n) @ R_{P 0}]$ . Otherwise, $\bar{P S D} (n)$ is encoded at a lower bit rate, $R_{P 1}$ or $R_{P 2}$ , and the remaining bit rate, $R_{R 0}$ , $R_{R 1}$ , or $R_{R 2}$ , is assigned to the RSD bitstream. However, if $M (n) = 1$ , $\bar{R S D} (n)$ is composed of the nth speech bitstream encoded at $R_{R 0}$ ; that is, $\bar{R S D} (n) = [P S D (n + 1) @ R_{R 0}]$ . In addition, if $M (n) = 2$ , $\bar{R S D} (n)$ is composed of the $(n + 1)$ th and $(n + 2)$ th speech bitstreams, which are encoded at $R_{R 1}$ and $R_{R 2}$ , respectively; that is, $\bar{R S D} (n) = [P S D (n + 1) @ R_{R 1}, P S D (n + 2) @ R_{R 2}]$ . Consequently, the total bit rate for the speech data is maintained.

Figure 6

Bit rate assignment according to different RST modes.

4. Performance Evaluation

To demonstrate the effectiveness of the proposed adaptive speech streaming method, a VoWMSN system was created using the ITU-T Recommendation G.729.1 as a scalable speech coder as shown in Figure 7. In fact, the proposed method and other speech streaming methods were implemented in the application layer. For the evaluation, input speech signals were sampled at 16 kHz and encoded using the ITU-T Recommendation G.729.1 speech encoder at a bit rate of 32 kbit/s. The bit rate assignment for the PSD and RSD bitstreams according to different RST modes was performed as shown in Table 1. In addition, $θ_{Q_{1}}$ and $θ_{Q_{2}}$ in (2) were set to 3.75 and 4.15 MOS, respectively.

Table 1

Bit rate assignment according to different RST modes, where $R_{P 1} = R_{P 0} - R_{R 0}$ and $R_{P 2}$ = $R_{P 0} - R_{R 1} - R_{R 2}$ .

RST mode $M (n)$	PSD(n) (kbit/s)			RSD(n) (kbit/s)
RST mode $M (n)$	$R_{P 0}$	$R_{P 1}$	$R_{P 2}$	$R_{R 0}$	$R_{R 1}$	$R_{R 2}$
0	32	—	—	—	—	—
1	—	16	—	16	—	—
2	—	—	16	—	8	8

Figure 7

Structure of a VoWMSN system employing different speech streaming methods implemented in the application layer.

To compare the performance of the PSQ estimation within the equivalent transmission bandwidth, two conventional speech streaming methods were implemented: a decoder-based PLC method [21] and an RST method [10]. The decoder-based PLC method encoded speech signals using the ITU-T Recommendation G.729.1 encoder at 32 kbit/s with no RSD bitstream, and the conventional RST method also encoded speech signals using the ITU-T Recommendation G.729.1 encoder at a fixed bit rate of 16 kbit/s with an RSD bitstream of 16 kbit/s. In this experiment, 48 speech utterances were taken from the NTT-AT speech database [24], where each speech utterance was approximately 4 s long and was sampled at a rate of 16 kHz. Each utterance was filtered using a modified intermediate reference system (IRS) filter, followed by automatic level adjustment [25]. To evaluate the quality of the decoded speech for each method, scores of wideband perceptual evaluation of speech quality (WPESQ) were measured as defined by the ITU-T Recommendation P.862.2 [26]. For the simulation of WMSN conditions, the Gilbert-Elliot (GE) channel model [25] was used to simulate the packet loss conditions because it is able to characterize the fading of a wireless network [6, 27, 28]. In this paper, burst PLRs were generated from 0 to 25% at a step of 5% using the GE model. Note that the mean and maximum burst packet losses were measured at approximately 1.5 and 4 packets, respectively.

Figure 8 compares the WPESQ scores (in MOS) of decoded speech processed using different speech streaming methods under different PLR conditions. As shown in this figure, the WPESQ score of the proposed method was better than those of the conventional methods. In other words, the proposed method improved average WPESQ score by as much as 0.55 and 0.2 MOS, compared to the decoder-based PLC method and the RST method, respectively.

Figure 8

Comparison of WPESQ scores measured in MOS for three different speech streaming methods under burst PLR conditions ranging from 0 to 25%.

5. Conclusion

In this paper, an adaptive speech streaming method was proposed to improve speech quality of a voice over wireless multimedia sensor network (WMSN). To this end, the proposed method first classified each frame of input speech signals as either an onset frame or a nononset frame. Next, it estimated the perceived speech quality (PSQ) of the received speech data under packet loss conditions. On the basis of the estimated PSQ and the speech class, the proposed method determined an appropriate bit rate for the redundant speech data (RSD) that was transmitted with the primary speech data (PSD) to assist the speech decoder in reconstructing the speech signals for a lost frame. In particular, an artificial bandwidth extension technique was applied to the narrowband speech decoded when the estimated packet loss rate (PLR) was high. The effectiveness of the proposed method was demonstrated by implementing a voice over WMSN system employing the proposed method. A performance evaluation indicated that the proposed method significantly improved the decoded speech quality relative to the conventional methods under different PLR conditions ranging from 0% to 25%.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the government of Korea (MSIP) (no. 2015R1A2A1A05001687), by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2009-0093828), and by the MSIP, Korea, under the ITRC (Information Technology Research Center) support program (IITP-2015-H8501-15-1016) supervised by the IITP (Institute for Information & Communications Technology Promotion).

References

Almalkawi

I. T.

Zapata

M. G.

Al-Karaki

J. N.

Morillo-Pozo

Wireless multimedia sensor networks: current trends and future directions

Sensors 2010 10 7 6662 6717

10.3390/s100706662

2-s2.0-77957126868

Akyildiz

I. F.

Melodia

Chowdhury

K. R.

A survey on wireless multimedia sensor networks

Computer Networks 2007 51 4 921 960

10.1016/j.comnet.2006.10.002

2-s2.0-33845708421

Melodia

Akyildiz

I. F.

Cross-layer QoS-aware communication for ultra wide band wireless multimedia sensor networks

IEEE Journal on Selected Areas in Communications 2010 28 5 653 663

10.1109/jsac.2010.100604

2-s2.0-79251597714

Kang

J. A.

Kim

H. K.

Adaptive redundant speech transmission over wireless multimedia sensor networks based on estimation of perceived speech quality

Sensors 2011 11 9 8469 8484

10.3390/s110908469

2-s2.0-80053215161

Mangharam

Rowe

Rajkumar

Suzuki

Voice over sensor networks

Proceedings of the 27th IEEE International Real-Time Systems Symposium (RTSS ′06)

December 2006

Rio de Janeiro, Brazil

291 300

10.1109/rtss.2006.51

2-s2.0-38949190254

Park

N. I.

Kang

J. A.

Lee

S. R.

Kim

H. K.

A packet loss concealment technique improving quality of service for wideband speech coding in wireless sensor networks

International Journal of Distributed Sensor Networks 2014 2014 8

852798

10.1155/2014/852798

2-s2.0-84899970091

Brunelli

Maggiorotti

Benini

Bellifemine

F. L.

Analysis of audio streaming capability of Zigbee networks

Wireless Sensor Networks: 5th European Conference, EWSN 2008, Bologna, Italy, January 30–February 1, 2008. Proceedings 2008 4913

Berlin, Germany

Springer

189 204 Lecture Notes in Computer Science

10.1007/978-3-540-77690-1_12

2-s2.0-49949083520

Merazka

Improved packet loss recovery using interleaving for CELP-type speech coders in packet networks

IAENG International Journal of Computer Science 2009 36 1 5

Lizhong

Muqing

Lulu

Mojia

An adaptive forward error control method for voice communication

Proceedings of the 2nd International Conference on Networking and Digital Society (lCNDS ′10)

May 2010

Wenzhou, China

186 189

10.1109/icnds.2010.5479338

2-s2.0-77954485503

10.

Kouvelas

Hodson

Hardman

Crowcroft

Redundancy control in real-time internet audio conferencing

Proceedings of the International Workshop on Audio-Visual Services over Packet Networks (AVSPN ′97)

September 1997

Aberdeen, Scotland

195 201

11.

Park

Lim

Cho

K.-R.

A dynamic packet recovery mechanism for realtime service in mobile computing environments

ETRI Journal 2003 25 5 356 368

10.4218/etrij.03.0102.0001

2-s2.0-0242291913

12.

T.-Y.

Guizani

Lee

W.-T.

Huang

P.-C.

An enhanced structure of layered forward error correction and interleaving for scalable video coding in wireless video delivery

IEEE Wireless Communications 2013 20 4 146 152

10.1109/MWC.2013.6590062

2-s2.0-84884581955

13.

3GPP TS 06.11 Substitution and Muting of Lost Frames for Full Rate Speech Channels 2000

14.

3GPP TS 26.091 Mandatory Speech Codec Speech Processing Functions; AMR Speech Codec; Error Concealment of Lost Frames 2010

15.

Park

N. I.

Kim

H. K.

Jung

M. A.

Lee

S. R.

Choi

S. H.

Burst packet loss concealment using multiple codebooks and comfort noise for CELP-type speech coders in wireless sensor networks

Sensors 2011 11 5 5323 5336

10.3390/s110505323

2-s2.0-79957753414

16.

Huang

Zhang

Recovery of lost speech segments using incremental subspace learning

ETRI Journal 2012 34 4 645 648

10.4218/etrij.12.0211.0408

2-s2.0-84864945170

17.

Kang

J. A.

Kim

H. K.

Choi

S. H.

Kim

S. R.

Adaptive redundant speech streaming with scalable speech coding based on speech quality estimation

Information 2014 17 5 1921 1932

18.

Park

N. I.

Lee

Y. H.

Kim

H. K.

Artificial bandwidth extension of narrowband speech signals for the improvement of perceptual speech communication quality

Communications in Computer and Information Science 2011 266 2 143 153

10.1007/978-3-642-27201-1_17

2-s2.0-83755220842

19.

Sollaud

RTP payload format for the G.729.1 audio codec

IETF RFC 2006 4749

20.

IETF RFC 5459 G.729.1 RTP Payload Format Update: Discontinuous Transmission (DTX) Support 2009

21.

Ragot

Kövesi

Trilling

Virette

Duc

Massaloux

Proust

Geiser

Gartner

Schandl

Taddei

Gao

Shlomot

Ehara

Yoshida

Vaillancourt

Salami

Lee

M. S.

Kim

D. Y.

ITU-T G.729.1: an 8-32 KBIT/S scalable coder interoperable with G.729 for wideband telephony and voice over IP

Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ′07)

April 2007

Honolulu, Hawaii, USA

IV529 IV532

10.1109/icassp.2007.366966

2-s2.0-34547525622

22.

Malfait

Berger

Kastner

P.563-The ITU-T standard for single-ended speech quality assessment

IEEE Transactions on Audio, Speech and Language Processing 2006 14 6 1924 1934

10.1109/tasl.2006.883177

2-s2.0-39649083007

23.

Gao

Shlomot

Benyassine

Thyssen

Murgia

The SMV algorithm selected by TIA and 3GPP2 for CDMA applications

Proceedings of the IEEE Interntional Conference on Acoustics, Speech, and Signal Processing (ICASSP ′01)

May 2001

Salt Lake City, Utah, USA

709 712

2-s2.0-0034841945

24.

NTT-AT Multi-Lingual Speech Database for Telephonometry 1994

25.

ITU-T Recommendation G.191 Software Tools for Speech and Audio Coding Standardization 2010

26.

International Telecommunication Union (ITU) ITU-T Recommendation P.862.2, Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codecs 2008

International Telecommunication Union (ITU)

27.

Khan

Peng

Steinbach

Sgroi

Kellerer

Application-driven cross-layer optimization for video streaming over wireless networks

IEEE Communications Magazine 2006 44 1 122 130

10.1109/MCOM.2006.1580942

2-s2.0-31744443320

28.

Nagano

Ito

Packet loss concealment of voice-over IP packet using redundant parameter transmission under severe loss conditions

Journal of Information Hiding and Multimedia Signal Processing 2014 5 2 286 295