Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
arXiv cs.CV / 4/24/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- Sparse Forcing is proposed as a training-and-inference method for autoregressive diffusion video generation that boosts long-horizon quality while lowering decoding latency.
- The approach is based on the finding that attention during autoregressive diffusion rollouts repeatedly focuses on a persistent subset of salient visual blocks, which effectively acts like spatiotemporal memory in the KV cache.
- Sparse Forcing introduces a trainable native sparsity mechanism to compress, preserve, and update these persistent blocks, while limiting computation to a dynamically chosen local neighborhood within sliding windows.
- For scalability on GPUs, it also presents Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and KV-cache updates for low-latency decoding.
- Experiments on 5-second and longer text-to-video generation show improved VBench scores (including +0.26 over Self-Forcing on 5 seconds) along with faster decoding (about 1.11–1.17x) and reduced KV-cache memory usage (42% lower peak footprint), with larger benefits for longer horizons.



