AI Navigate

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses motion fidelity and long-term temporal coherence in video diffusion models after fine-tuning.
  • It introduces pixel-motion rewards based on pixel flux dynamics to capture both instantaneous and long-term motion consistency.
  • It proposes Smooth Hybrid Fine-tuning (SHIFT), unifying supervised fine-tuning with advantage-weighted fine-tuning in a reward-driven framework and leveraging adversarial benefits to improve convergence and reduce reward hacking.
  • Experiments show SHIFT effectively resolves dynamic-degree collapse in modern video diffusion models during supervised fine-tuning.

Abstract

Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.