A Systematic Post-Train Framework for Video Generation

arXiv cs.CV / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a deployment gap for large-scale video diffusion models, citing issues including prompt sensitivity, temporal inconsistency, and high inference costs.
  • It proposes a four-stage post-training framework: supervised fine-tuning for stable instruction following, RLHF using a video-tailored Group Relative Policy Optimization (GRPO) method for better perceptual quality and temporal coherence.
  • It adds a prompt-enhancement step using a dedicated language model to better align user inputs with desired outputs.
  • It includes inference optimization to reduce cost while maintaining controllability learned during pretraining.
  • Experiments report reduced common generation artifacts and significant gains in controllability and visual aesthetics under strict sampling-cost constraints.

Abstract

While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.