Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that autoregressive video synthesis struggles with long-horizon semantic forgetting, visual drift from positional extrapolation, and loss of controllability when switching instructions during generation.
  • It proposes “Grounded Forcing,” which combines three mechanisms—Dual Memory KV Cache, Dual-Reference RoPE Injection, and Asymmetric Proximity Recache—to jointly preserve global semantics, limit positional drift, and maintain controllability across prompt transitions.
  • Dual Memory KV Cache decouples local temporal dynamics from global semantic anchors to improve identity stability and reduce semantic degradation over long sequences.
  • Dual-Reference RoPE Injection aims to keep positional embeddings within the training manifold while making global semantics time-invariant, reducing visual drift.
  • Experiments reportedly show improved long-range consistency and visual stability for interactive long-form video synthesis, suggesting a more robust foundation for infinite-horizon generation.

Abstract

Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.