Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

arXiv cs.CV / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article surveys feed-forward 3D scene modeling methods that reconstruct 3D representations from 2D inputs in a single forward pass, aiming to overcome the slow optimization and limited scalability of traditional per-scene approaches.
It argues that, despite different geometric output formats (e.g., implicit fields vs. explicit primitives), recent feed-forward methods often share common architectural patterns such as image feature backbones, multi-view fusion, and geometry-aware components.
The survey introduces a new, representation-agnostic taxonomy that organizes the research into five problem-driven directions: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware models.
To ground the taxonomy empirically, it reviews benchmarks and datasets and discusses standardized evaluation practices, alongside categorizing real-world applications for feed-forward 3D models.
It concludes by outlining open challenges and future directions, including scalability, stronger evaluation standards, and broader “world modeling” capabilities.

Abstract

Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per-scene optimization or category-specific training, which hinders their practical deployment and scalability. Hence, generalizable feed-forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross-scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed-forward approaches share similar high-level architectural patterns, such as image feature extraction backbones, multi-view information fusion mechanisms, and geometry-aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal-aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real-world applications based on feed-forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.