DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

arXiv cs.CV / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DDA-Thinker, a framework that separates the “Thinker” planning module from a fixed “Editor” generative model to better evaluate and optimize reasoning-driven image editing.
  • It uses dual-atomic reinforcement learning that splits feedback into two verifiable checklist-based rewards: a cognitive-atomic reward for the quality of the executable plan and a visual-atomic reward for the final image quality.
  • The checklist synthesis is improved by incorporating not only the source image and user instruction but also a rational reference description of the ideal post-edit scene.
  • A two-stage data curation pipeline is proposed to build a diverse, reasoning-focused dataset and then refine it with difficulty-aware filtering to create an effective reinforcement learning curriculum.
  • Experiments on RISE-Bench and KRIS-Bench show substantial gains, with a community model reaching performance competitive with proprietary models under a fixed-editor paradigm.

Abstract

Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforcement learning framework. This framework decomposes feedback into two distinct atomic rewards implemented through verifiable checklists: a cognitive-atomic reward to directly assess the quality of the Thinker's executable plan, which serves as the actionable outcome of the Thinker's reasoning, and a visual-atomic reward to assess the final image quality. To improve checklist quality, our checklist synthesis is grounded not only in the source image and user instruction but also in a rational reference description of the ideal post-edit scene. To support this training, we further develop a two-stage data curation pipeline that first synthesizes a diverse and reasoning-focused dataset, then applies difficulty-aware refinement to curate an effective training curriculum for reinforcement learning. Extensive experiments on reasoning-driven image editing benchmarks, including RISE-Bench and KRIS-Bench, demonstrate that our approach substantially improves overall performance. Our method enables a community model to achieve results competitive with strong proprietary models, highlighting the practical potential of Thinker-centric optimization under a fixed-editor setting.