Self-Corrected Image Generation with Explainable Latent Rewards

arXiv cs.AI / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

既存のテキストから画像生成は、複雑なプロンプトに対する細かな意味や空間関係の整合が難しく、その根本要因として生成がフィードフォワードで“先回り”の調整が必要になる点を指摘しています。
提案手法xLARDは、マルチモーダルLLMと「Explainable Latent Rewards」を用いて生成中に自己修正できるフレームワークで、軽量なcorrectorが潜在表現を構造化されたフィードバックで更新します。
画像レベル評価は非微分になりがちですが、xLARDは“潜在編集→解釈可能な報酬信号”への微分可能な写像を導入し、非微分の評価からでも連続的な潜在ガイダンスを可能にします。
多様な生成・編集タスクの実験で、意味整合と視覚的忠実性の向上を示しつつ、生成的な事前分布（generative priors）を維持することを報告しています。

Abstract

Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.