AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • AdaSpark proposes an adaptive sparsity framework to make Video-LLMs practical for long-form video by avoiding the high compute cost of dense processing.
  • The method partitions videos into 3D spatio-temporal cubes and uses co-designed, context-aware components (AdaS-Attn for cube selection and AdaS-FFN for token selection) to focus compute on what matters per query.
  • An entropy-based (Top-p) selection strategy dynamically allocates resources based on input complexity rather than relying on rigid sparse patterns.
  • Experiments report up to 57% FLOPs reduction while preserving comparable performance to dense models and maintaining fine-grained long-range temporal dependencies on hour-scale benchmarks.

Abstract

Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.