AI Navigate

Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

Key Points

  • The paper introduces Anchor Forcing, a cache-centric framework that addresses two interactive streaming video diffusion failure modes: loss of boundary conditioning at prompt switches and a drift in motion priors due to unbounded time indexing.
  • It proposes an anchor-guided re-cache mechanism that stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch to reduce post-switch evidence loss and stabilize perceptual quality.
  • It also presents a tri-region RoPE with region-specific reference origins and RoPE re-alignment distillation to reconcile unbounded streaming indices with the pretrained RoPE regime and better retain long-horizon motion priors.
  • Experiments on long videos show improved perceptual quality and motion metrics over prior streaming baselines, with a project page provided for implementation details.

Abstract

Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: https://github.com/vivoCameraResearch/Anchor-Forcing