SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

arXiv cs.LG / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 拡散モデルのポストトレーニングでは、SFT(教師あり微調整)が正しい前向きノイズ経路の状態にのみ最適化されるため、推論時に逸脱すると露出バイアスが発生し、補正が学習されないというギャップがある。
  • 提案手法SOARは、実サンプルからモデルを1回ロールアウトしてオフ軌道状態を得た後、それを再ノイズし、元のクリーンターゲットへ戻すように自己矯正するオンポリシー・報酬フリーのバイアス補正を行う。
  • SOARは報酬モデルも報酬信号も不要で、各タイムステップに対する密な教師信号によりクレジット割当問題を回避できる。
  • SD3.5-MediumでGenEvalとOCRがSFTから大幅に改善し、さらにモデルベースの嗜好スコアも総じて向上するなど、RLなしでもアラインメント改善が確認された。

Abstract

The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR's base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.