Small Vision-Language Models are Smart Compressors for Long Video Understanding

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that adapting multimodal LLMs for hour-long videos is limited by context/token budgets and resulting fidelity loss, especially due to dense visual streams and “lost-in-the-middle” effects.
  • It proposes Tempo, a query-aware long-video compression framework that uses a Small Vision-Language Model (SVLM) as a local temporal compressor to produce compact, intent-aligned representations in a single forward pass.
  • Tempo introduces Adaptive Token Allocation (ATA), a training-free O(1) dynamic routing method that allocates more bandwidth to query-critical segments while compressing redundant parts into minimal temporal anchors without breaking causality.
  • Experiments report that a 6B Tempo model achieves state-of-the-art long-form video understanding, including 52.3 on extreme-long LVBench (4101s) under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro.
  • The results suggest long-form video understanding should rely on intent-driven efficiency and structured compression rather than simply expanding/padding context windows.

Abstract

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.