View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

arXiv cs.CV / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key limitation of text-driven 3D scene editing: maintaining cross-view consistency when converting between rendered multi-view images, 2D edits, and 3D optimization.
  • It reframes consistent 3D editing as joint distribution modeling across viewpoints, explicitly injecting cross-view dependencies into the editing pipeline.
  • The proposed dual-path consistency mechanism uses projection-guided structural guidance and patch-level semantic propagation to improve both geometric/structural alignment and semantic continuity across views.
  • The authors build a paired multi-view editing dataset to provide reliable supervision for learning cross-view consistency, and report stronger results on complex scenes with more precise, consistent views.

Abstract

Text-driven 3D scene editing has recently attracted increasing attention. Most existing methods follow a render-edit-optimize pipeline, where multi-view images are rendered from a 3D scene, edited with 2D image editors, and then used to optimize the underlying 3D representation. However, cross-view inconsistency remains a major bottleneck. Although recent methods introduce geometric cues, cross-view interactions, or video priors to mitigate this issue, they still largely rely on inference-time synchronization and thus remain limited in robustness and generalization.In this work, we recast multi-view consistent 3D editing from a distributional perspective: 3D scene editing essentially requires a joint distribution modeling across viewpoints.Based on this insight, we propose a view-consistent 3D editing framework that explicitly introduces cross-view dependencies into the editing process. Furthermore, motivated by the observation that structural correspondence and semantic continuity rely on different cross-view cues, we introduce a dual-path consistency mechanism consisting of projection-guided structural guidance and patch-level semantic propagation for effective cross-view editing. Further, we construct a paired multi-view editing dataset that provides reliable supervision for learning cross-view consistency in edited scenes. Extensive experiments demonstrate that our method achieves superior editing performance with precise and consistent views for complex scenes.