Sage Journals: Discover world-class research

Abstract

The Convolution-augmented Transformer (Conformer) model, which was recently introduced, has attained state-of-the-art(SOTA) results in Automatic Speech Recognition (ASR). In this paper, a series of methodical investigations uncover that the Conformer’s design decisions may not represent the most efficient choices when operating within the constraints of a limited computational budget. After a thorough re-evaluation of the Conformer architecture’s design choices, we propose Sampleformer which reduces the Conformer architecture complexity and has more robust performance. We introduce downsampling to the Conformer Encoder, and to exploit the information in the speech features, we incorporate an additional downsampling module to enhance the efficiency and accuracy of our model. Additionally, we propose a novel and adaptable attention mechanism called multi-group attention, effectively reducing the attention complexity from $O(n^{2}d)$ to $O(n^{2}d\cdot f/g)$ . We performed experiments on the AISHELL-1 corpora, our 13.3 million-parameter CTC model demonstrates a 3.0%/2.6% relative reduction in character error rate (CER) on the dev/test sets, all without the utilization of a language model (LM). Additionally, the model exhibits a 30% improvement in inference compared to our CTC Conformer baseline and trains 27% faster.

Keywords

Speech recognition conformer attention mechanism complexity reduction

Get full access to this article

View all access options for this article.

References

Kriman

et al., Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions, in: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 6124–6128.

Majumdar

et al., Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition, arXiv preprint arXiv:2104.01721, 2021.

Han

et al., Contextnet: Improving convolutional neural networks for automatic speech recognition with global context, arXiv preprint arXiv:2005.03191, 2020.

et al., Jasper: An end-to-end convolutional neural acoustic model, arXiv preprint arXiv:1904.03288, 2019.

Karita

et al., A comparative study on transformer vs rnn in speech applications, in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2019, pp. 449–456.

Likhomanenko

et al., Rethinking evaluation in asr: Are our models robust enough, arXiv preprint arXiv:2010.11745, 2020.

Liu

et al., Improving RNN transducer based ASR with auxiliary tasks, in: 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2021, pp. 172–179.

Devlin

et al., Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

Vaswani

et al., Attention is all you need, Advances in Neural Information Processing Systems 30 (2017).

10.

Cantú-Paz

, Pruning neural networks with distribution estimation algorithms, in: Genetic and Evolutionary Computation – GECCO 2003: Genetic and Evolutionary Computation Conference Chicago, IL, USA, July 12–16, 2003 Proceedings, Part I, Springer, 2003, pp. 790–800.

11.

Dally

, High-performance hardware for machine learning, Nips Tutorial 2 (2015), 3.

12.

Tan

and Le

, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International Conference on Machine Learning, PMLR, 2019, pp. 6105–6114.

13.

Buciluǎ

Caruana

and Niculescu-Mizil

, Model compression, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006, pp. 535–541.

14.

Gulati

et al., Conformer: Convolution-augmented transformer for speech recognition, arXiv preprint arXiv:2005.08100, 2020.

15.

E.G.

et al., Pushing the limits of non-autoregressive speech recognition, arXiv preprint arXiv:2104.03416, 2021.

16.

Zhang

et al., Pushing the limits of semi-supervised learning for automatic speech recognition, arXiv preprint arXiv:2010.10504, 2020.

17.

Shim

Choi

and Sung

, Understanding the role of self attention for efficient speech recognition, in: International Conference on Learning Representations, 2022.

18.

et al., Understanding and improving transformer from a multi-particle dynamic system point of view, arXiv preprint arXiv:1906.02762, 2019.

19.

Radford

et al., Improving language understanding by generative pre-training, 2018.

20.

Kim

et al., I-bert: Integer-only bert quantization, in: International Conference on Machine Learning, PMLR, 2021, pp. 5506–5518.

21.

et al., Nn-lut: neural approximation of non-linear operations for efficient transformer inference, in: Proceedings of the 59th ACM/IEEE Design Automation Conference, 2022, pp. 577–582.

22.

Burchi

and Vielzeuf

, Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition, in: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2021, pp. 8–15.

23.

Kim

et al., Squeezeformer: An efficient transformer for automatic speech recognition, arXiv preprint arXiv:2206.00888, 2022.

24.

Shen

and Sun

, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.

25.

and Zhang

X.-L.

, Efficient conformer-based speech recognition with linear attention, in: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, 2021, pp. 448–453.

26.

Shen

et al., Efficient attention: Attention with linear complexities, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 3531–3539.

27.

Wang

Sun

Xie

and Ma

, Efficient conformer with prob-sparse attention mechanism for end-to-end speech recognition, arXiv preprint arXiv:2106.09236, 2021.

28.

Ramachandran

et al., Stand-alone self-attention in vision models, Advances in Neural Information Processing Systems 32 (2019).

29.

Parmar

et al, Image transformer, in: International Conference on Machine Learning, PMLR, 2018, pp. 4055–4064.

30.

Hori

Watanabe

Zhang

and Chan

, Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM, arXiv preprint arXiv:1706.02737, 2017.

31.

Zhang

et al., Towards end-to-end speech recognition with deep convolutional neural networks, arXiv preprint arXiv:1701.02720, 2017.

32.

Howard

A.G.

et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861, 2017.

33.

Sifre

and Mallat

, Rigid-motion scattering for texture classification, arXiv preprint arXiv:1403.1687, 2014.

34.

Chang

et al., End-to-End ASR with Adaptive Span Self-Attention, in: INTERSPEECH, 2020, pp. 3595–3599.

35.

Zhang

Loweimi

Bell

and Renals

, Stochastic attention head removal: A simple and effective method for improving transformer based asr models, arXiv preprint arXiv:2011.04004, 2020.

36.

Zhang

Loweimi

Bell

and Renals

, On the usefulness of self-attention for automatic speech recognition with transformers, in: 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2021, pp. 89–96.

37.

Perslev

et al., U-time: A fully convolutional network for time series segmentation applied to sleep staging, Advances in Neural Information Processing Systems 32 (2019).

38.

Ronneberger

Fischer

and Brox

, U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, Springer, 2015, pp. 234–241.

39.

Jiang

et al., Nextformer: A ConvNeXt Augmented Conformer For End-To-End Speech Recognition, arXiv preprint arXiv:2206.14747, 2022.

40.

et al., Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline, in: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), IEEE, 2017, pp. 1–5.

41.

Fan

et al., Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6824–6835.

42.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

43.

et al., Improved multiscale vision transformers for classification and detection, arXiv preprint arXiv:2112.01526, 2021.

44.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

45.

Roy

A.G.

Navab

and Wachinger

, Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018: 21st International Conference, Granada, Spain, September 16–20, 2018, Proceedings, Part I, Springer, 2018, pp. 421–429.

46.

Park

D.S.

et al., Specaugment on large scale datasets, in: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 6879–6883.

47.

Park

D.S.

et al., Specaugment: A simple data augmentation method for automatic speech recognition, arXiv preprint arXiv:1904.08779, 2019.

48.

Kudo

and Richardson

, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226, 2018.

49.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.

50.

Hermans

J.R.

Spanakis

and Möckel

, Accumulated gradient normalization, in: Asian Conference on Machine Learning, PMLR, 2017, pp. 439–454.

51.

Pan

et al., ASAPP-ASR: Multistream CNN and self-attentive SRU for SOTA speech recognition, arXiv preprint arXiv:2005.10469, 2020.

52.

Han

K.J.

Prieto

and Ma

, State-of-the-art speech recognition using multi-stream self-attention with dilated 1d convolutions, in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2019, pp. 54–61.

53.

Wang

et al., Transformer-based acoustic modeling for hybrid speech recognition, in: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 6874–6878.

Sampleformer: An efficient conformer-based Neural Network for Automatic Speech Recognition

Abstract

Keywords

Get full access to this article

References