Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Unified Multimodal Models (UMMs) can understand far better than they generate, suggesting their internal knowledge is not fully activated during generation.
  • The paper introduces UniRect-CoT, a training-free “reflective rectification” chain-of-thought approach that iteratively reflects during generation to activate inherent understanding and correct intermediate outputs.
  • It treats the UMM diffusion denoising process as intrinsic visual reasoning and uses alignment of intermediate results with the target instruction as a self-supervisory signal for generation rectification.
  • Experiments indicate UniRect-CoT can be plugged into existing UMMs and yields substantial improvements in generation quality across a variety of complex tasks.
  • Overall, the work frames a “free lunch” from UMMs’ existing capabilities, showing how reflective correction can close the understanding–generation gap without additional training.

Abstract

Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch'' hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.