GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key limitation of Diffusion LLMs (DLLMs): achieving precise, training-free image editing is difficult because discrete tokenization breaks standard noise inversion approaches and can degrade image structure.
  • It proposes GIDE (Grounded Inversion for DLLM Image Editing), introducing a discrete noise inversion mechanism and a three-stage pipeline (grounding, inversion, refinement) to enable higher-fidelity reconstruction and stricter background preservation.
  • GIDE is designed to support multiple instruction types for editing, including text prompts as well as point- and box-based guidance, while maintaining the unedited background.
  • The authors introduce GIDE-Bench, a benchmark with 805 compositional editing scenarios across diverse multi-modal inputs, and report large gains over prior training-free methods (Semantic Correctness +51.83%, Perceptual Quality +50.39%).
  • Additional tests on ImgEdit-Bench show consistent improvements over trained baselines and photorealistic quality comparable to leading models, suggesting broader applicability of the method.

Abstract

While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.