CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

arXiv cs.AI / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

CoVSpec introduces an efficient device-edge co-inference approach for vision-language model (VLM) deployment by using speculative decoding between a lightweight mobile “draft” model and a larger edge “target” model.
The framework tackles speculative decoding inefficiencies in VLMs by reducing redundant visual tokens on-device through a training-free pruning method based on query relevance, token activity, and low-rank dependency.
CoVSpec improves efficiency further with an adaptive drafting strategy that dynamically tunes verification frequency and draft length to match runtime conditions.
It also proposes a parallel branching mechanism with decoupled verification-correction to better utilize draft-side computation during target-side verification while cutting correction-related communication.
Experiments on multiple benchmarks report up to 2.21× higher throughput than target-only inference and over 96% communication overhead reduction without sacrificing accuracy.

Abstract

Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A practical alternative is device-edge co-inference, where a lightweight draft VLM on the mobile device collaborates with a larger target VLM on the edge server via speculative decoding. Nevertheless, directly extending speculative decoding to VLMs suffers from severe inefficiency due to excessive visual-token computation and high communication overhead. To address these challenges, we propose CoVSpec, an efficient collaborative speculative decoding framework for VLM inference. Specifically, we first develop a training-free visual token reduction framework that prunes redundant visual tokens on the mobile device by jointly considering query relevance, token activity, and low-rank dependency. Moreover, we design an adaptive drafting strategy that dynamically adjusts both the verification frequency and the draft length. In addition, we introduce a parallel branching mechanism with decoupled verification-correction to improve draft-side utilization during target-side verification and reduce correction-related transmission overhead. Experiments on multiple benchmarks show that CoVSpec achieves up to 2.21x higher throughput than target-only inference and reduces communication overhead by more than 96% compared with baselines, without compromising task accuracy.

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

Dev.to

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Dev.to

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

Dev.to

v1.83.14-stable.patch.2

LiteLLM Releases

CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

Key Points

Abstract

Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

v1.83.14-stable.patch.2

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer