Sage Journals: Discover world-class research

Abstract

Pretrained transformer models have demonstrated excellent performance on complex tasks. To improve their inference efficiency, recent studies have introduced the multi-exit mechanism, which enables early exiting through multiple intermediate classifiers. However, the deep architectures of pretrained transformers cause severe gradient conflicts during multi-exit fine-tuning, leading to degraded shallow-exit accuracy and reduced early-exit efficiency. To address this issue, we propose Separate Reverse, a multi-exit training strategy specifically designed for pretrained transformer models. The method iteratively integrates reverse iterative optimization and hierarchical knowledge distillation from deeper to shallower exits, maintaining pretrained parameter integrity, enhances the representation capacity of shallow exits, and coordinates gradient updates across exits to achieve a balanced optimization between shallow and deep classifiers. Experiments on multiple GLUE benchmark datasets using BERT demonstrate that our method significantly improves shallow-exit accuracy, maintains main-exit performance, and accelerates inference for simple samples by a large margin.

Keywords

multi-exit transformer gradient conflict model training strategy model optimization

Get full access to this article

View all access options for this article.

References

Aghajanyan

Conneau

Hsu

W.-N.

Hambardzumyan

Zhang

Roller

Goyal

Levy

Zettlemoyer

(2023). Scaling laws for generative mixed-modal language models. In Proceedings of the 40th international conference on machine learning (ICML'23) (Vol. 202, pp. 265–279). JMLR.org.

Bajpai

D. J.

Hanawal

M. K.

(2024). CeeBERT: Cross-domain inference in early exit BERT. arXiv preprint arXiv:2405.15039.

Cambria

White

(2014). Jumping NLP curves: A review of natural language processing research. IEEE Computational Intelligence Magazine, 9(2), 48–57.

Chen

Fan

Huang

Guddanti

K. P.

(2024). Artificial intelligence/machine learning technology in power system applications. Pacific Northwest National Laboratory (PNNL).

Chen

Pan

Ding

Zhou

(2023). EE-LLM: Large-scale training and inference of early-exit large language models with 3D parallelism. arXiv preprint arXiv:2312.04916.

de Barcelos Silva

Gomes

M. M.

Da Costa

C. A.

Righi

Rosa

Barbosa

J. L. V.

Pessin

De Doncker

Federizzi

(2020). Intelligent personal assistants: A systematic literature review. Expert Systems with Applications, 147, 113193.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (NAACL-HLT) (Vol. 1, pp. 4171–4186). Association for Computational Linguistics.

Gao

Liu

Huang

Hou

(2023). PF-BERxiT: Early exiting for BERT with parameter-efficient fine-tuning and flexible early exiting strategy. Neurocomputing, 558, 126690.

Geng

Gao

Zhang

(2021). RomeBERT: Robust training of multi-exit BERT. arXiv preprint arXiv:2101.09755.

10.

Gou

Maybank

S. J.

Tao

(2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 1789–1819.

11.

Hou

Huang

Shang

Jiang

Chen

Liu

(2020). DynabERT: Dynamic BERT with adaptive width and depth. Advances in Neural Information Processing Systems, 33, 9782–9793.

12.

Huang

Chen

Van Der Maaten

Weinberger

K. Q.

(2017). Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844.

13.

Wang

Chen

Zhang

(2023). Early exit with disentangled representation and equiangular tight frame. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 14128–14142). Association for Computational Linguistics.

14.

Kurtic

Campos

D. F.

Nguyen

Frantar

Kurtz

Fineran

Goin

Alistarh

(2022). The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259.

15.

Laskaridis

Kouris

Lane

N. D.

(2021). Adaptive inference through early-exit networks: Design, challenges and directions. In Proceedings of the 5th International workshop on embedded and mobile deep learning (MobiSys ’21) (pp. 1–6). ACM.

16.

Lattanzi

Contoli

Freschi

(2023). Do we need early exit networks in human activity recognition?. Engineering Applications of Artificial Intelligence, 121, 106035.

17.

Liao

Couillet

Mahoney

M. W.

(2020). Sparse quantized spectral clustering. arXiv preprint arXiv:2010.01376.

18.

Liu

Hao

Liu

Weng

Wang

F. L.

(2023). OdeBERT: One-stage deep-supervised early-exiting BERT for fast inference in user intent classification. ACM Transactions on Asian and Low-Resource Language Information Processing, 22, 1–18.

19.

Liu

Tao

Feng

Zhao

(2022). Multi-granularity structural knowledge distillation for language model compression. In Proceedings of the 60th annual meeting of the Association for Computational Linguistics (Volume 1: Long papers) (pp. 1001–1011). Association for Computational Linguistics.

20.

Liu

Zhou

Zhao

Wang

Deng

(2020). FastBERT: A self-distilling BERT with adaptive inference time. arXiv preprint arXiv:2004.02178.

21.

Liu

Zhu

Belkin

(2022). Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis 59, 85–116.

22.

Michel

Levy

Neubig

(2019). Are sixteen heads really better than one? In Proceedings of the 33rd international conference on neural information processing systems (pp. 14037–14047). Curran Associates Inc.

23.

Rahmath

Haseena

Srivastava

Chaurasia

Pacheco

R. G.

Couto

R. S.

(2024). Early-exit deep neural network: A comprehensive survey. ACM Computing Surveys, 57(3), 1–37.

24.

Rotem

Hassid

Mamou

Schwartz

(2023). Finding the SWEET spot: Analysis and improvement of adaptive inference in low resource settings. In Proceedings of the 61st annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 14836–14851). Association for Computational Linguistics.

25.

Schuster

Fisch

Gupta

Dehghani

Bahri

Tran

V. Q.

Tay

Metzler

(2022). Confident adaptive language modeling. arXiv preprint arXiv:2207.07061.

26.

Tang

Wang

Kong

Zhang

Ding

Wang

Liang

(2023). You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10781–10791). IEEE.

27.

Treviso

Lee

J-U.

van Aken

Cao

Ciosici

M. R.

Hassid

Heafield

Hooker

Raffel

Martins

P. H.

Martins

A. F. T.

Forde

J. Z.

Milder

Simpson

Slonim

Dodge

Strubell

Balasubramanian

Derczynski

Gurevych

Schwartz

(2023). Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 11, 826–860.

28.

Wang

Singh

Michael

Hill

Levy

Bowman

S. R.

(2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP (pp. 353–355). Association for Computational Linguistics.

29.

Xin

Tang

Lee

Lin

(2020). DeeBERT: Dynamic early exiting for accelerating BERT inference. arXiv preprint arXiv:2004.12993.

30.

Xin

Tang

Lin

(2021). BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. In Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics (EACL 2021) (pp. 91–104). Association for Computational Linguistics.

31.

Xin

Tang

Lin

J. J.

(2021). BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. In Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 91–104). Association for Computational Linguistics .

32.

Pan

Zhou

Chen

Lian

Dai

(2025). Specee: Accelerating large language model inference with speculative early exiting. In Proceedings of the 52nd annual international symposium on computer architecture (ISCA) (pp. 467–481).

33.

Yin

Jin

Zhang

Wei

Liu

(2023). LLMCAD: Fast and scalable on-device large language model inference. arXiv preprint arXiv:2309.04255.

34.

Guo

Wei

Zhou

Wang

(2025). EdgeMoE: Empowering sparse large language models on mobile devices. arXiv preprint arXiv:2308.14352.

35.

Hua

Huang

Shi

(2022). Boosted dynamic neural networks. arXiv preprint arXiv:2211.16726.

36.

Zhu

Wang

Xie

Wang

(2023). BADGE: Speeding up BERT inference after deployment via block-wise bypasses and divergence-based early exiting. In Proceedings of the 61st annual meeting of the Association for Computational Linguistics (Volume 5: Industry track) (pp. 500–509). Association for Computational Linguistics.

Separate Reverse: A Gradient-Conflict-Free Training Framework for Multi-Exit Transformers

Abstract

Keywords

Get full access to this article

References