A speech compression method without utilizing signal prediction

Abstract

Previous speech compression methods for practical purposes had been based on signal prediction, taking the auditory functions into account but overlooking features specific to speech signals. A new method was developed in which amplitude envelopes in four frequency bands corresponding to spectral factors common to different languages were used to modulate infinitely peak-clipped signals, which also had been revealed to contain useful linguistic information. In a pilot experiment, intelligibility reached ~80% with limited information of only 2,400 bits per second (bps), whereas the bit rate of the original signal was 256,000 bps. This algorithm preserves the naturalness of speech and is easy to grasp intuitively.

Keywords

speech signal low-bit-rate speech coding vocoding intelligibility periodicity peak-clipping amplitude envelope

How to cite this article

Matsuo, I., Ueda, K., & Nakajima, Y. (2025). A speech compression method without utilizing signal prediction. i–Perception, 16(3), 1–5. https://doi.org/10.1177/20416695251340236

Signal predictions, such as linear predictive coding (LPC; e.g., Kohler, 1997; Schroeder & Atal, 1985), have been widely used to encode speech waveforms with reduced bit rates. However, even the state-of-the-art code-excited linear prediction (CELP) requires more than 4,000 bits per second (bps), and thus developing a speech encoding technique that requires less than 4,000 bps had been considered challenging (McCree, 2008). We propose an algorithm called 1-bit vocoding that conveys speech with bit rates lower than 4,000 bps, while preserving intelligibility and naturalness.

Amplitude envelopes in the four frequency bands common to several different languages (Ueda & Nakajima, 2017) provide enough information to make speech signals reasonably intelligible (Ellermeier et al., 2015). Figure 1(a) illustrates amplitude-modulated noises in these four frequency bands. Sound 1 (800 bps) in online Movie 1 (Japanese) and online Movie 2 (English) demonstrates that combined amplitude-modulated noises are intelligible (Shannon et al., 1995; Ueda et al., 2018). However, this procedure, noise-vocoding, makes the speech quality very unnatural. Aiming at recovering naturalness, we adopt infinitely peak-clipped speech instead of noise, as it gives some information on vowels and consonants (Licklider & Pollack, 1948; Sakai & Inoue, 1960) [Figure 1(b)]. Naturalness recovers as in Sound 2 (online Movies 1 and 2) (2,400 bps). This happened without training listeners.

Figure 1.

Narrowband spectrograms showing portions of the Japanese speech stimulus samples and audio demonstration materials (in Japanese in online Movie 1 and in English in online Movie 2). (a) A noise-vocoded speech stimulus (Sound 1, 800 bps), (b) a proposed 1-bit vocoded speech stimulus (Sound 2, 2,400 bps), (c) a proposed 1-bit vocoded speech stimulus (Sound 3, 4,800 bps), (d) the original speech sample (Sound 4, 256,000 bps). The amplitude envelopes of the noise-vocoded and the 1-bit vocoded speech stimuli were sampled at a 25 Hz sampling rate and quantized at 8 bits.

As in Figure 2, the encoding process consists of two sets of devices, extracting band-pass filtered amplitude envelopes with the upper set and coarse periodicity information with the lower set. A speech signal is divided into four frequency bands in the upper set, and an amplitude envelope is extracted in each frequency band with the Hilbert transformation (HT). The extracted amplitude envelopes are quantized at 6, 8, or 16 bits, and sampled at 25 or 40 Hz rates. The bit rate for the four amplitude envelopes can be as low as 600 bps in total (6 bits at a 25 Hz rate in four frequency bands). In the lower set, the coarse periodicity is conveyed with a 1-bit transformation, that is, converting positive samples in the waveform into 1 and all other samples into 0 at 800, 1600, 2000, 4000, and 16,000 per second, with the minimum additional cost for processing and amount of information compared with the original noise-vocoding. The bit rate for this coarse periodicity signal is the same as the sampling rate. The speech bit rate for the 1-bit vocoding technique is the sum of the bit rates for the amplitude envelopes and the coarse periodicity. Smoothed amplitude envelopes are multiplied by the filtered 1-bit transformed signals at the decoding stage.

Figure 2.

A block diagram showing the encoding and decoding processes using the proposed 1-Bit Vocoder. The encoding process consists of two sets of devices, extracting amplitude envelopes with the upper set and coarse periodicity information with the lower set. A speech signal is divided into four frequency bands with BPFs in the upper set of the diagram. The amplitude envelopes are extracted with the HT, quantized at 6, 8, or 16 bits, and sampled at 25 or 40 Hz rates. The 1-bit transforming process is run in the lower set by converting the positive samples in the original waveform into “1” and all the other samples into “0”. smoothed amplitude envelopes are multiplied by the filtered 1-bit transformed signals at the decoding stage.

For a pilot experiment, 30 everyday sentences in Japanese spoken by a male speaker were selected from the NTT-AT Multilingual Speech Database 2002 (NTT-AT, Kawasaki, Japan; recorded with a 16 kHz sampling rate and 16 bit linear quantization). All stimuli were produced with the 1-bit vocoding technique. Each sentence was presented only once to each listener. These results with 12 Japanese native listeners showed that intelligibility of about 80% could be attained even for speech stimuli compressed to 2,400 bps [Figure 3 (b), at the 1600 Hz sampling frequency]. For Sound 2 [as in Figure 1(b), in Japanese in online Movie 1 and in English in online Movie 2], amplitude envelopes were sampled at a 25 Hz rate and quantized at 8 bits for four frequency bands [20–510, 510–1270, 1270–2700, and 2700–6400 Hz, based on Ueda and Nakajima (2017)], and coarse periodicity was obtained at a sampling rate of 1600 Hz.

Figure 3.

Results of the pilot experiment showing mora (a unit in Japanese shorter than or equal to a syllable) accuracy with 12 Japanese native participants. (a–c) Mean percentages of mora accuracy as functions of sampling frequency. Bit depth: (a) 16 bits, (b) 8 bits, and (c) 6 bits. The sampling rates for amplitude envelopes were 25 (white bars) and 40 Hz (black bars). (d) Mora accuracy is indicated as a function of the bit rate of encoding. The circles and triangles represent 25 and 40 Hz rates for amplitude envelope sampling. The black, blue, and red curves show 16, 8, and 6 bits for quantization. Error bars reflect standard errors of the mean (SEMs).

Speech signals severely degraded by removing periodicity and reducing frequency resolution are hard to hear (Sound 1 in online Movies 1 and 2). Acoustic signals resynthesized without periodicity sound very unnatural, especially for the first listening. Adding coarse (low bit) periodicity cues recovers the naturalness (Sound 2 in online Movies 1 and 2) and improves the speech quality. This demonstration clarifies that the temporal periodicity, as observed in infinitely peak-clipped speech (Licklider & Pollack, 1948), plays a role different from that of the amplitude envelopes in the four frequency bands corresponding to spectral factors of speech (Ueda & Nakajima, 2017), leading to an intuitively simple speech-compressing algorithm requiring bit rates well below 4,000 bps, while enabling reasonable communication.

Supplemental Material

Footnotes

ORCID iDs

Ikuo Matsuo

Kazuo Ueda

Acknowledgments

The authors thank Asahiro Ito for running the experiment, and an anonymous reviewer for providing insightful comments on the draft.

Author Contributions

Ikuo Matsuo: Conceptualization; Data curation; Investigation; Software; Visualization; Writing – original draft; Writing – review & editing.

Kazuo Ueda: Formal analysis; Methodology; Project administration; Validation; Writing – review & editing.

Yoshitaka Nakajima: Conceptualization; Methodology; Supervision; Writing – review & editing.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by JSPS KAKENHI Grant Numbers JP25242002, JP17H06197, JP17K18705, and JP19H00630.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

References

Ellermeier

Kattner

Ueda

Doumoto

Nakajima

(2015). Memory disruption by irrelevant noise-vocoded speech: Effects of native language and the number of frequency bands. The Journal of the Acoustical Society of America, 138, 1561–1569. https://doi.org/10.1121/1.4928954

Kohler

M. A.

(1997). A comparison of the new 2400 bps MELP federal standard with other standard coders. In 1997 IEEE international conference on acoustics speech and signal processing (pp. 1587–1590). IEEE. https://doi.org/10.1109/ICASSP.1997.596256

Licklider

J. C. R.

Pollack

(1948). Effects of differentiation, integration, and infinite peak clipping upon the intelligibility of speech. The Journal of the Acoustical Society of America, 20, 42–51. https://doi.org/10.1121/1.1906346

McCree

A. V.

(2008). Low-bit-rate speech coding. In Benesty

Sondhi

M. M.

Huang

Y. A.

(Eds.), Springer handbooks of speech processing (pp. 331–350). Springer.

Sakai

Inoue

(1960). New instruments and methods for speech analysis. The Journal of the Acoustical Society of America, 32, 441–450. https://doi.org/10.1121/1.1908093

Schroeder

M. R.

Atal

B. S.

(1985). Code excited linear prediction (CELP): High quality speech at very low bit rates. In ICASSP ‘85. IEEE international conference on acoustics speech and signal processing (pp. 937–940). IEEE. https://doi.org/10.1109/ICASSP.1985.1168147

Shannon

R. V.

Zeng

F. G.

Kamath

Wygonski

Ekelid

(1995). Speech recognition with primarily temporal cues. Science, 270, 303–304. https://doi.org/10.1126/science.270.5234.303

Ueda

Araki

Nakajima

(2018). Frequency specificity of amplitude envelope patterns in noise-vocoded speech. Hearing Research, 367, 169–181. https://doi.org/10.1016/j.heares.2018.06.005

Ueda

Nakajima

(2017). An acoustic key to eight languages/dialects: Factor analyses of critical-band-filtered speech. Scientific Reports, 7, 1–4. https://doi.org/10.1038/srep42468

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB