Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

arXiv cs.CV / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Hybrid Forcing,” a hybrid attention architecture aimed at improving long-horizon streaming video generation by better retaining distant temporal history than sliding-window attention alone.
  • It combines lightweight linear temporal attention with compact key-value state to absorb and retain evicted tokens, and block-sparse local attention to cut redundant short-range computation.
  • The authors propose a decoupled distillation approach: initial few-step distillation under dense attention, followed by activating distillation for the linear and block-sparse components for stable training.
  • Experiments across short- and long-form video generation benchmarks report state-of-the-art performance, including real-time, unbounded 832×480 generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or compression.
  • Code and trained models are provided via the linked GitHub repository, enabling replication and further development of the method.

Abstract

Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within short-range modeling, reallocating computational capacity toward more critical dependencies. Finally, we introduce a decoupled distillation strategy tailored to the hybrid attention design. A few-step initial distillation is performed under dense attention, then the distillation of our proposed linear temporal and block-sparse attention is activated for streaming modeling, ensuring stable optimization. Extensive experiments on both short- and long-form video generation benchmarks demonstrate that Hybrid Forcing consistently achieves state-of-the-art performance. Notably, our model achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression. The source code and trained models are available at https://github.com/leeruibin/hybrid-forcing.