PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

arXiv cs.CV / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that instruction-following image editors need adaptive inference that varies both spatial coverage and reasoning depth, because different edits (e.g., color swap vs. physical-action changes) require different computation budgets.
  • It introduces PhysEdit, a region-aware image editing framework that adds two inference-time modules—Complexity-Adaptive Reasoning Depth (CARD) and a Spatial Reasoning Mask (SRM)—without retraining the backbone.
  • CARD predicts per-sample edit complexity from the instruction and reference image and conditionally allocates the number of reasoning steps and reasoning token length, turning a fixed inference schedule into conditional computation.
  • SRM uses instruction-conditioned cross-attention to produce a spatial prior that restricts reasoning to semantically relevant regions, improving where the model spends its effort.
  • On the ImgEdit Basic-Edit Suite (737 cases), PhysEdit achieves a 1.18× wall-clock speedup while slightly improving instruction adherence and maintaining identity preservation, with larger gains (up to 1.52×) for appearance-level edits.

Abstract

Image editing instructions are heterogeneous: a color swap, an object insertion, and a physical-action edit all demand different spatial coverage and different reasoning depth, yet existing reasoning-based editors apply a single fixed inference recipe to every instruction. We argue that adaptivity along both the spatial and temporal axes is the missing degree of freedom, and we present PhysEdit, an editing framework built around this principle. PhysEdit introduces two inference-time modules that compose without retraining the backbone. At its core, (1) Complexity-Adaptive Reasoning Depth (CARD) predicts edit complexity directly from the instruction and reference image and allocates the reasoning step count N_r and reasoning-token length r per sample -- turning a previously fixed inference schedule into a conditional-computation problem. CARD is supported by (2) a Spatial Reasoning Mask (SRM) that extracts an instruction-conditioned spatial prior from cross-attention to confine reasoning to regions that semantically require it. On the full 737-case ImgEdit Basic-Edit Suite, PhysEdit delivers a 1.18x wall-clock speedup (64.3s vs. 76.1s per sample) over a strong reasoning baseline while slightly improving instruction adherence (CLIP-T 0.2283 vs. 0.2266, +0.7%) and matching identity preservation within noise (CLIP-I 0.8246 vs. 0.8280). The speedup is category-dependent and reaches 1.52x on appearance-level edits, validating CARD's adaptive allocation as the principal source of efficiency gain. A 30-sample pilot with full ablations isolates the contribution of each module.