Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
arXiv cs.CV / 5/4/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- Diffusion Transformers for video generation face high latency because attention has quadratic complexity, motivating the use of sparse attention to speed up generation.
- The paper finds that prior sparse attention methods underperform under the same compute budget due to poor critical-token selection (position-based rather than semantic) and inefficient GPU computation (critical tokens are scattered).
- It introduces SVG2 (Sparse VideoGen2), a training-free framework that improves critical-token identification and reduces computation waste using a semantic-aware permutation based on k-means clustering and token reordering.
- SVG2 further adds top-p dynamic budget control and customized kernel implementations, reporting up to 2.30× and 1.89× speedups while preserving video quality metrics (PSNR up to 30 on HunyuanVideo and 26 on Wan 2.1).
- The authors open-source the codebase at the provided GitHub repository, enabling reproducibility and adoption by others working on efficient video generation.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B
Reddit r/LocalLLaMA