AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper studies training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that uses a single attention stream mixing noise and source-image tokens.
It introduces KVInject, a single-forward KV (key/value) injection method that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band, improving results versus the prior MasaCtrl approach while avoiding prompt-mismatch failures.
The authors find that different edit types require different attention operations, leading to AttnRouter, a per-category routing table that dispatches edits to the attention manipulation best preserving source structure for each category.
Using ground-truth edit categories, AttnRouter boosts a CLIP-T+DINO-I composite score by 6.4% over a baseline, and an automatic CLIP zero-shot classifier recovers 98% of the gain despite only 55% category accuracy.
Ablations localize the effective attention sub-circuit: early denoising-step K/V injection (S0–7) nearly matches full-step gains, while other layer/step bands and naive K/V rescaling fail, and the authors release code, routing tables, and benchmark subsets.

Abstract

We study training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that concatenates noise and source-image tokens within a single attention stream. We make three contributions. (i) We introduce KVInject, a single-forward attention manipulation that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band. KVInject is simpler than the classical two-pass MasaCtrl recipe and avoids the prompt-mismatch failure mode that disables MasaCtrl on MMDiT (composite score drops 31% versus baseline). (ii) We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. (iii) Through layer-, step-, and alpha-band ablations we localize the editing-effective attention sub-circuit: K/V injection in early denoising steps (S0-7) recovers nearly all of the gain of full-step injection, while injection in early (L0-15) or late (L45-60) layer bands fails to drive editing entirely; alpha in [0.3, 0.5] is a stable sweet spot. We also report negative results that highlight what does not transfer from the UNet folklore: simple K/V rescaling never beats baseline and aggressive variants collapse generation entirely (composite 0.084). We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations.