PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

arXiv cs.CV / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PhyEdit, a physically grounded image editing framework aimed at more precise real-world object manipulation by addressing failures in scaling and positioning caused by missing 3D geometry mechanisms.
  • PhyEdit improves manipulation accuracy by using a plug-and-play explicit 3D prior with geometric simulation as 3D-aware guidance, combined with joint 2D–3D supervision.
  • The authors release RealManip-10K, a real-world dataset containing paired images and depth annotations to support 3D-aware object manipulation research and evaluation.
  • They also propose ManipEval, a benchmark with multi-dimensional metrics to assess 3D spatial control and geometric consistency.
  • Experiments indicate PhyEdit outperforms prior approaches, including strong closed-source models, on both 3D geometric accuracy and manipulation consistency.

Abstract

Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D--3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.