Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

arXiv cs.CV / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • Sparse Forcing is proposed as a training-and-inference method for autoregressive diffusion video generation that boosts long-horizon quality while lowering decoding latency.
  • The approach is based on the finding that attention during autoregressive diffusion rollouts repeatedly focuses on a persistent subset of salient visual blocks, which effectively acts like spatiotemporal memory in the KV cache.
  • Sparse Forcing introduces a trainable native sparsity mechanism to compress, preserve, and update these persistent blocks, while limiting computation to a dynamically chosen local neighborhood within sliding windows.
  • For scalability on GPUs, it also presents Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and KV-cache updates for low-latency decoding.
  • Experiments on 5-second and longer text-to-video generation show improved VBench scores (including +0.26 over Self-Forcing on 5 seconds) along with faster decoding (about 1.11–1.17x) and reduced KV-cache memory usage (42% lower peak footprint), with larger benefits for longer horizons.

Abstract

We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and memory updates for low-latency, memory-efficient decoding. Experiments show that Sparse Forcing improves the VBench score by +0.26 over Self-Forcing on 5-second text-to-video generation while delivering a 1.11-1.17x decoding speedup and 42% lower peak KV-cache footprint. The gains are more pronounced on longer-horizon rollouts, delivering improved visual quality with +0.68 and +2.74 VBench improvements, and 1.22x and 1.27x speedups on 20-second and 1-minute generations, respectively.