CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

arXiv cs.AI / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • CoVSpec introduces an efficient device-edge co-inference approach for vision-language model (VLM) deployment by using speculative decoding between a lightweight mobile “draft” model and a larger edge “target” model.
  • The framework tackles speculative decoding inefficiencies in VLMs by reducing redundant visual tokens on-device through a training-free pruning method based on query relevance, token activity, and low-rank dependency.
  • CoVSpec improves efficiency further with an adaptive drafting strategy that dynamically tunes verification frequency and draft length to match runtime conditions.
  • It also proposes a parallel branching mechanism with decoupled verification-correction to better utilize draft-side computation during target-side verification while cutting correction-related communication.
  • Experiments on multiple benchmarks report up to 2.21× higher throughput than target-only inference and over 96% communication overhead reduction without sacrificing accuracy.

Abstract

Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A practical alternative is device-edge co-inference, where a lightweight draft VLM on the mobile device collaborates with a larger target VLM on the edge server via speculative decoding. Nevertheless, directly extending speculative decoding to VLMs suffers from severe inefficiency due to excessive visual-token computation and high communication overhead. To address these challenges, we propose CoVSpec, an efficient collaborative speculative decoding framework for VLM inference. Specifically, we first develop a training-free visual token reduction framework that prunes redundant visual tokens on the mobile device by jointly considering query relevance, token activity, and low-rank dependency. Moreover, we design an adaptive drafting strategy that dynamically adjusts both the verification frequency and the draft length. In addition, we introduce a parallel branching mechanism with decoupled verification-correction to improve draft-side utilization during target-side verification and reduce correction-related transmission overhead. Experiments on multiple benchmarks show that CoVSpec achieves up to 2.21x higher throughput than target-only inference and reduces communication overhead by more than 96% compared with baselines, without compromising task accuracy.