Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

arXiv cs.CV / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SCOPE, a training-free framework to speed up autoregressive (AR) video diffusion by reducing wasteful denoising work across frames.
  • It uses a tri-modal scheduler—cache, predict, and recompute—so the method can handle intermediate situations where binary reuse/recompute decisions are too coarse.
  • Prediction is performed with noise-level Taylor extrapolation, and the approach includes stability controls supported by error propagation analysis.
  • SCOPE also applies selective computation by restricting execution to the active frame interval, avoiding uniform processing over the entire valid range.
  • Experiments on MAGI-1 and SkyReels-V2 show up to 4.73× speedups with output quality comparable to the original, outperforming prior training-free baselines.

Abstract

Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.