Sage Journals: Discover world-class research

Abstract

In the rapid development of artificial intelligence, multimodal large models (VLM) have led a new wave of technological progress with their revolutionary breakthroughs in the field of natural language processing (NLP). In the wave of artificial intelligence, on-device multimodal large models (On-Device VLM) are becoming the new favourites of technological innovation with their rapid development speed and broad application prospects, and the demand for on-device inference is growing. This study conducts in-depth adaptation and optimization of multimodal large models on the Neural Network Processing Unit (NPU) based on the Qualcomm platform. By adopting the QNN (Qualcomm Neural Network) framework and model compression techniques such as quantization, pruning, knowledge distillation, low-rank factorization, and Lookahead decoding, efficient inference acceleration on Qualcomm NPU is achieved, significantly improving the model’s response speed and decoding efficiency. Experimental results show that the optimized model has significant improvements in first response time and decoding speed, providing a new solution for on-device AI applications.

Keywords

Multimodal large model on-device inference optimization neural network processing unit qualcomm neural processing SDK for AI

Get full access to this article

View all access options for this article.

References

Zhou

Kuang

Zhan

, et al. MobileViT: lightweight vision transformer for mobile edge applications. ArXiv preprint arXiv:2302.04549, 2023.

Luo

Chen

. TinyVLM: efficient vision-language model for mobile visual understanding. In: International conference on computer vision (ICCV), 2023.

Wang

Zhang

Cai

, et al. Edge-AI vision-language models: design, challenges, and solutions. IEEE Trans Neur Netw Learn Syst 2023; 34: 2100–2115.

Chen

Wang

, et al. Real-time visual-language processing for augmented reality. Cons Electr IEEE Trans 2022; 68: 187–194.

Zhang

Wang

, et al. Efficient vision-language transformers for mobile vision-question answering. In: Proceedings of the AAAI conference on artificial intelligence, 2023.

Gao

Liu

Tang

, et al. Cross-modal learning for edge devices: challenges and innovations. IEEE Access 2023; 11: 35745–35758.

Huang

Feng

Shang

, et al. Lightweight vision-language interaction models for edge computing. J Parall Distrib Comput 2023; 179: 104–118.

Sharma

Singh

Bhatia

. Edge intelligence: efficient multimodal learning with vision-language models. ACM Comput Surv 2023; 56: 1–38.

Molchanov

Tyree

Karras

, et al. Pruning convolutional neural networks for resource efficient inference. In: ICLR 2017, 2017.

10.

Zhang

Sun

. Channel pruning for accelerating very deep neural networks. In: ICCV 2017, 2017.

11.

Lee

Ajanthan

Torr

PHS

. “SNIP: single-shot network pruning based on connection sensitivity,” In: ICLR 2019, 2019.

12.

Molchanov

Mallya

Tyree

, et al. “Importance estimation for neural network pruning,” In: CVPR 2019, 2019.

13.

Ilhan

Tekin

, et al. Resource-efficient transformer pruning for finetuning of large models. In: 2024 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp.16206–16215. Seattle, WA, USA.

14.

Lin

Maire

Belongie

, et al. Microsoft coco: common objects in context. In: cocodataset, 2014, https://cocodataset.org.

15.

Paszke

Gross

Massa

, et al. Pytorch: An imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 2019 ; 32: 8024–8035.

16.

Bolya

Foley

Hays

, et al. Tide: A general toolbox for identifying object detection errors. In: Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, 2020, pp.558–573. Springer.

17.

Park

Kang

Paik

. Cosine similarity-guided knowledge distillation for robust object detectors. Nature 2024; 626: 512–520.

18.

Jie

DWC

. Distributed constrained optimization with periodic dynamic quantization. Automatica 2023; 158: 111283.

19.

Zhang

Xie

. Robust low-rank tensor recovery with outlier spikes and its application to hyperspectral imagery. IEEE Trans Image Process 2023; 32: 1234–1248.

20.

Chen

. Fast low-rank matrix factorization with adaptive sampling. Data Min Knowl Discov 2024.

21.

Wang

Zhang

Liu

. Low-rank subspace clustering with structure-aware graph. IEEE Trans Pattern Anal Mach Intell 2024; 46: 2105–2119.

22.

Liu

Zhang

. Scalable low-rank tensor completion with deep neural networks. Neural Netw 2023; 168: 1–15.

23.

LMSYS, Break the sequential dependency of llm inference using lookahead decoding. LMSYS Org, 2023.

Edge-side NPU inference optimization: Adaptation research of multimodal large models on qualcomm platforms

Abstract

Keywords

Get full access to this article

References