Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Salt (SC-DMD), a distillation method aimed at improving real-time video generation under extremely low inference budgets (about 2–4 function evaluations / NFEs).
It addresses shortcomings of prior consistency distillation by explicitly regularizing how consecutive denoising updates compose so that rollouts remain endpoint-consistent instead of drifting or over-smoothing.
Salt further enhances autoregressive low-NFE generation by treating the KV cache as a conditioning quality signal and using Cache-Distribution-Aware training with a cache-conditioned feature alignment objective.
Experiments on both non-autoregressive backbones (e.g., Wan 2.1) and autoregressive real-time paradigms (e.g., Self Forcing) reportedly yield consistently better low-NFE output quality while staying compatible with different KV-cache memory mechanisms.
The authors state that code will be released, indicating the method is intended to be reproducible and usable for further research and deployment-oriented experimentation.

Abstract

Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.