From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

arXiv cs.CV / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper argues that many training-free, text-guided image editing methods follow a competitive setup where editing and reconstruction branches optimize separate prompt objectives, leading to semantic conflicts and unstable results.
  • It introduces CoEdit, a zero-shot framework that reframes attention control as “coopetitive negotiation” between branches to coordinate editing decisions across spatial regions and time steps.
  • Spatially, CoEdit uses Dual-Entropy Attention Manipulation to model directional entropic interactions between branches and convert attention control into a harmony-maximization problem for better localization of editable versus preservable areas.
  • Temporally, it proposes Entropic Latent Refinement to adjust latent states during denoising, reducing accumulated editing errors and improving consistency of semantic transitions over the denoising trajectory.
  • Experiments on standard benchmarks show improved editing quality and stronger structural/background preservation, and the method also includes a Fidelity-Constrained Editing Score that jointly measures semantic change and fidelity; code is planned for release on GitHub.

Abstract

Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at https://github.com/JinhaoShen/CoEdit.