Abstract
In the rapid development of artificial intelligence, multimodal large models (VLM) have led a new wave of technological progress with their revolutionary breakthroughs in the field of natural language processing (NLP). In the wave of artificial intelligence, on-device multimodal large models (On-Device VLM) are becoming the new favourites of technological innovation with their rapid development speed and broad application prospects, and the demand for on-device inference is growing. This study conducts in-depth adaptation and optimization of multimodal large models on the Neural Network Processing Unit (NPU) based on the Qualcomm platform. By adopting the QNN (Qualcomm Neural Network) framework and model compression techniques such as quantization, pruning, knowledge distillation, low-rank factorization, and Lookahead decoding, efficient inference acceleration on Qualcomm NPU is achieved, significantly improving the model’s response speed and decoding efficiency. Experimental results show that the optimized model has significant improvements in first response time and decoding speed, providing a new solution for on-device AI applications.
Keywords
Get full access to this article
View all access options for this article.
