Abstract
The Convolution-augmented Transformer (Conformer) model, which was recently introduced, has attained state-of-the-art(SOTA) results in Automatic Speech Recognition (ASR). In this paper, a series of methodical investigations uncover that the Conformer’s design decisions may not represent the most efficient choices when operating within the constraints of a limited computational budget. After a thorough re-evaluation of the Conformer architecture’s design choices, we propose Sampleformer which reduces the Conformer architecture complexity and has more robust performance. We introduce downsampling to the Conformer Encoder, and to exploit the information in the speech features, we incorporate an additional downsampling module to enhance the efficiency and accuracy of our model. Additionally, we propose a novel and adaptable attention mechanism called multi-group attention, effectively reducing the attention complexity from
Get full access to this article
View all access options for this article.
