Sage Journals: Discover world-class research

Abstract

Text-to-Speech (TTS) technology converts text into human-like speech, aiding the visually impaired, providing voice assistants, and enabling automated news broadcasting. This study proposes P VALL-E, an efficient speech synthesis system enhancing Microsoft’s VALL-E model by replacing its Transformer architecture with Performer structures through a layer-wise strategy. The proposed mechanism improves processing speed for long texts and reduces parameter count, making the model suitable for resource-constrained environments. Pre-trained with English and Simplified Chinese speech data, P VALL-E leverages multilingual training to transfer knowledge to less-resourced Taiwanese speech data, improving performance across languages. A language embedding mechanism is also incorporated for accent control and personalized synthesis. Experimental results show P VALL-E matches the original VALL-E in accuracy while boosting generation speed by approximately 20%. Even with limited data, the proposed architecture performs well in multilingual settings. Add to this an Android app was developed, running the model on a server due to its high computational requirements, and transmitting results to users’ devices.

Keywords

Accent control performer speech synthesis text to speech vALL-E

Get full access to this article

View all access options for this article.

References

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012; 25.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Comput 1997; 9: 1735–1780.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30.

Wang

Chen

, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:230102111 2023.

Oord

Avd

Dieleman

Zen

, et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:160903499 2016.

Wang

Skerry-Ryan

Stanton

, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:170310135 2017.

Ren

Ruan

Tan

, et al. fastspeech: Fast, robust and controllable text to speech. Adv Neural Inf Process Syst 2019; 32.

Tan

Chen

Liu

, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Trans Pattern Anal Mach Intell 2024; 46: 4234–4245.

Zhou

Zhang

Zhou

, et al. Accented text-to-speech synthesis with limited data. IEEE/ACM Trans Audio Speech Lang Process 2024; 32: 1699–1711.

10.

Guo

Chen

, et al. Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance. In: ICASSP 2023-2023 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp.1–5. IEEE.

11.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:201011929 2020.

12.

Gulati

Qin

Chiu

, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:200508100 2020.

13.

Choromanski

Likhosherstov

Dohan

, et al. Rethinking attention with performers. arXiv preprint arXiv:200914794 2022.

14.

Devlin

Chang

Lee

, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 2018.

15.

Achiam

Adler

Agarwal

, et al. Gpt-4 technical report. arXiv preprint arXiv:230308774 2023.

16.

Radford

Kim

, et al. Robust speech recognition via large-scale weak supervision. In: International conference on machine learning, pp.28492–28518. PMLR.

17.

Rubenstein

Asawaroengchai

Nguyen

, et al. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:230612925 2023.

18.

Tay

Dehghani

Abnar

, et al. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:201104006 2020.

19.

Zaheer

Guruganesh

Dubey

, et al. Big bird: Transformers for longer sequences. Adv Neural Inf Process Syst 2020; 33: 17283–17297.

20.

Casanova

Weber

Shulby

, et al. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In: International conference on machine learning, pp.2709–2720. PMLR.

21.

Zen

Dang

Clark

, et al. A corpus derived from librispeech for text-to-speech. arxiv 2019. arXiv preprint arXiv:190402882.

22.

Veaux

Yamagishi

MacDonald

, et al. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit 2016.

23.

, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp.1–5. IEEE.

24.

Shi

, et al. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:201011567 2020.

25.

Beijing DataTang Technology Co, Ltd. Aidatatang, a free chinese mandarin speech corpus. Online, n.d. http://www.datatang.com.

26.

Primewords Information Technology Co L. Primewords chinese corpus set 1, 2018. https://www.primewords.cn.

27.

Wang

Zhang

. Thchs-30 : A free chinese speech corpus, 2015. http://arxiv.org/abs/1512.01882.

28.

SpeechOcean. St-cmds-20170001_1, free st chinese mandarin corpus. Online, 2017. http://www.speechocean.com.

29.

Ardila

Branson

Davis

, et al. Common voice: A massively-multilingual speech corpus. In: Proceedings of the 12th conference on language resources and evaluation (LREC 2020), pp.4211–4215.

P VALL-E: An efficient multilingual speech synthesis system based on the performer architecture

Abstract

Keywords

Get full access to this article

References