CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • CTCalは、従来の拡散損失がテキストと生成画像の細かな対応付けを十分に“明示的”に監督できないことを、課題の主要因として捉えています。
  • 小さいtimestepでノイズが少ない段階で得られる信頼できるクロスアテンション(text-image alignment)を、大きいtimestepの表現学習のキャリブレーションに転用することで、学習時の明示的な位置合わせを実現します。
  • CTCalは拡散損失との併用に向けて、timestepに応じた適応的重み付けも提案し、両者を整合的に統合できる設計です。
  • 研究ではT2I-Compbench++とGenEvalでの実験により、モデル非依存(model-agnostic)で汎用性が高く、SD 2.1のような拡散ベースからSD 3のようなフローベースまで幅広く組み込めることを示しています。
  • 実装コードはGitHubで公開されており、既存のテキスト-to-イメージ生成モデルへ容易に適用できることが強調されています。

Abstract

Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., SD 2.1) and flow-based approaches (e.g., SD 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal. Our code is available at https://github.com/xiefan-guo/ctcal.