DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

arXiv cs.CV / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

DynamicRad introduces a content-adaptive sparse-attention method for long video diffusion, using a radial locality prior to avoid losing critical long-range information from static masks.
The approach supports a dual-mode execution strategy—static-ratio for faster inference and dynamic-threshold for quality-first filtering.
To prevent runtime overhead from online search, DynamicRad uses an offline Bayesian Optimization pipeline and a semantic motion router that maps prompt embeddings to appropriate sparsity regimes with minimal extra cost.
Experiments on HunyuanVideo and Wan2.1-14B show 1.7×–2.5× inference speedups with over 80% effective sparsity, and in some long-sequence cases the dynamic mode matches or beats dense baselines.
Mask-aware LoRA is also reported to further improve long-horizon coherence, and the authors provide code on GitHub.

Abstract

Leveraging the natural spatiotemporal energy decay in video diffusion offers a path to efficiency, yet relying solely on rigid static masks risks losing critical long-range information in complex dynamics. To address this issue, we propose \textbf{DynamicRad}, a unified sparse-attention paradigm that grounds adaptive selection within a radial locality prior. DynamicRad introduces a \textbf{dual-mode} strategy: \textit{static-ratio} for speed-optimized execution and \textit{dynamic-threshold} for quality-first filtering. To ensure robustness without online search overhead, we integrate an offline Bayesian Optimization (BO) pipeline coupled with a \textbf{semantic motion router}. This lightweight projection module maps prompt embeddings to optimal sparsity regimes with \textbf{minimal runtime overhead}. Unlike online profiling methods, our offline BO optimizes attention reconstruction error (MSE) on a physics-based proxy task, ensuring rapid convergence. Experiments on HunyuanVideo and Wan2.1-14B demonstrate that DynamicRad pushes the efficiency--quality Pareto frontier, achieving \textbf{1.7

\times

--2.5

\times

inference speedups} with \textbf{over 80\% effective sparsity}. In some long-sequence settings, the dynamic mode even matches or exceeds the dense baseline, while mask-aware LoRA further improves long-horizon coherence. Code is available at https://github.com/Adamlong3/DynamicRad.