NavCrafter: Exploring 3D Scenes from a Single Image

arXiv cs.CV / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

NavCrafter is presented as a framework for generating flexible 3D scenes from a single image by producing controllable novel-view video sequences while maintaining temporal-spatial consistency.
The method uses video diffusion models to learn rich 3D priors and applies a geometry-aware expansion strategy to progressively broaden scene coverage.
It introduces a multi-stage camera control mechanism (dual-branch camera injection plus attention modulation) to enable trajectory-conditioned, controllable multi-view synthesis.
The system includes collision-aware camera trajectory planning and an improved 3D Gaussian Splatting pipeline with depth-aligned supervision, structural regularization, and refinement.
Experiments reported in the abstract indicate state-of-the-art novel-view synthesis for large viewpoint changes and improved 3D reconstruction fidelity.

Abstract

Creating flexible 3D scenes from a single image is vital when direct 3D data acquisition is costly or impractical. We introduce NavCrafter, a novel framework that explores 3D scenes from a single image by synthesizing novel-view video sequences with camera controllability and temporal-spatial consistency. NavCrafter leverages video diffusion models to capture rich 3D priors and adopts a geometry-aware expansion strategy to progressively extend scene coverage. To enable controllable multi-view synthesis, we introduce a multi-stage camera control mechanism that conditions diffusion models with diverse trajectories via dual-branch camera injection and attention modulation. We further propose a collision-aware camera trajectory planner and an enhanced 3D Gaussian Splatting (3DGS) pipeline with depth-aligned supervision, structural regularization and refinement. Extensive experiments demonstrate that NavCrafter achieves state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improves 3D reconstruction fidelity.