Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

arXiv cs.CV / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles long-video generation with pre-trained video diffusion models by identifying two main sources of quality degradation: frame-level relative position out-of-distribution (O.O.D) and context-length O.O.D.
  • It proposes FreeLOC, a training-free, layer-adaptive framework that applies Video-based Relative Position Re-encoding (VRPR) to re-align temporal relative positions with the model’s pre-trained distribution.
  • For context-length O.O.D, it introduces Tiered Sparse Attention (TSA), which preserves local detail while maintaining long-range temporal dependencies through multi-scale attention structuring.
  • A layer-adaptive probing mechanism estimates which transformer layers are most sensitive to each O.O.D issue, enabling selective and efficient application of the corrections.
  • Experiments report state-of-the-art improvements over existing training-free methods in both temporal consistency and visual quality, with accompanying code released on GitHub.

Abstract

Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.