Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes Fine-grained Multimodal Reasoning (FiMR) to improve text-to-image generation by performing fine-grained, test-time self-reasoning rather than relying on coarse holistic alignment checks.
FiMR decomposes prompts into minimal semantic units (e.g., entities and attributes), verifies each unit using decomposed VQA, and produces explicit fine-grained feedback for each prompt component.
Using this feedback, the framework applies targeted, localized prompt refinements to better align generated images with the detailed attributes in the input text.
Experiments on compositional text-to-image benchmarks show FiMR consistently outperforms multiple baselines, including other reasoning-based methods.
The work focuses on enhancing control and alignment precision for unified multimodal LLMs that jointly understand and generate images, an area the authors note is underexplored.

Abstract

With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. Therefore, we propose Fine-grained Multimodal Reasoning (FiMR), a framework that leverages decomposed visual question answering (VQA) to break down an input prompt into minimal semantic units-such as entities and attributes-and verify each unit via VQA to generate explicit, fine-grained feedback. Based on this feedback, FiMR then applies targeted, localized refinements. This fine-grained self-reasoning and self-refinement enable MLLMs to achieve more precise improvements in image-prompt alignment and overall generation quality at test time. Extensive experiments demonstrate that FiMR consistently outperforms image generation baselines, including reasoning-based methods, particularly on compositional text-to-image benchmarks.