See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses high inference latency in video LLMs by proposing a training-free speculative decoding approach called LVSpec that uses a draft-and-verify paradigm tailored to Video-LLMs.
  • LVSpec relaxes speculative decoding constraints by using visual-semantic guidance to enforce strict verification only on visually relevant “anchor” tokens while allowing looser verification for visual-irrelevant filler tokens.
  • It introduces a lightweight visual-relevant token identification scheme to find those anchors and a position-shift-tolerant mechanism to accept semantically equivalent tokens even when their positions don’t match.
  • Experiments show LVSpec maintains very high fidelity (>99.8 of target performance) while significantly speeding up generation, achieving 2.70x and 2.94x acceleration on Qwen2.5-VL-32B and LLaVA-OneVision-72B.
  • Compared with existing training-free speculative decoding methods for Video-LLMs, LVSpec increases mean accepted length by 136% and improves speedup ratio by 35%, indicating substantially better throughput gains without model retraining.

Abstract

Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves >99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.