Abstract
Text-to-Speech (TTS) technology converts text into human-like speech, aiding the visually impaired, providing voice assistants, and enabling automated news broadcasting. This study proposes P VALL-E, an efficient speech synthesis system enhancing Microsoft’s VALL-E model by replacing its Transformer architecture with Performer structures through a layer-wise strategy. The proposed mechanism improves processing speed for long texts and reduces parameter count, making the model suitable for resource-constrained environments. Pre-trained with English and Simplified Chinese speech data, P VALL-E leverages multilingual training to transfer knowledge to less-resourced Taiwanese speech data, improving performance across languages. A language embedding mechanism is also incorporated for accent control and personalized synthesis. Experimental results show P VALL-E matches the original VALL-E in accuracy while boosting generation speed by approximately 20%. Even with limited data, the proposed architecture performs well in multilingual settings. Add to this an Android app was developed, running the model on a server due to its high computational requirements, and transmitting results to users’ devices.
Get full access to this article
View all access options for this article.
