Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the high computational cost of self-attention in Video Diffusion Transformers and argues that existing sparse-attention approaches can cause severe temporal flickering.
  • It introduces Precision-Allocated Sparse Attention (PASA), a training-free framework that dynamically budgets compute based on curvature-aware profiling of acceleration across timesteps.
  • PASA improves efficiency by using hardware-aligned grouped approximation instead of global homogenizing estimates, aiming to preserve local detail while maximizing throughput.
  • The method also adds stochastic selection bias to attention routing to soften rigid boundaries and prevent selection oscillation that leads to localized compute starvation and flicker.
  • Experiments on leading video diffusion models report substantial inference acceleration alongside smoother, structurally stable video generation sequences.

Abstract

Video Diffusion Transformers have revolutionized high-fidelity video generation but suffer from the massive computational burden of self-attention. While sparse attention provides a promising acceleration solution, existing methods frequently provoke severe visual flickering caused by static sparsity patterns and deterministic block routing. To resolve these limitations, we propose Precision-Allocated Sparse Attention (PASA), a training-free framework designed for highly efficient and temporally smooth video generation. First, we implement a curvature-aware dynamic budgeting mechanism. By profiling the generation trajectory acceleration across timesteps, we elastically allocate the exact-computation budget to secure high-precision processing strictly during critical semantic transitions. Second, we replace global homogenizing estimations with hardware-aligned grouped approximations, successfully capturing fine-grained local variations while maintaining peak compute throughput. Finally, we incorporate a stochastic selection bias into the attention routing mechanism. This probabilistic approach softens rigid selection boundaries and eliminates selection oscillation, effectively eradicating the localized computational starvation that drives temporal flickering. Extensive evaluations on leading video diffusion models demonstrate that PASA achieves substantial inference acceleration while consistently producing remarkably fluid and structurally stable video sequences.