Two Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents 2Xplat, a pose-free feed-forward 3D Gaussian Splatting framework that separates geometry estimation from appearance (Gaussian) generation using a two-expert design rather than a single monolithic network.
  • A dedicated geometry expert predicts camera poses, and those poses are explicitly provided to an appearance expert that synthesizes the 3D Gaussian representation.
  • The authors report that the approach reaches strong results in fewer than 5K training iterations and significantly outperforms prior pose-free feed-forward 3DGS methods.
  • 2Xplat’s performance is said to be on par with state-of-the-art posed methods, suggesting modular architectures may be preferable to unified “all-in-one” designs for high-fidelity 3D reconstruction.
  • The work challenges the dominant entangled-architecture paradigm and motivates further exploration of decoupled, modular design principles for geometry-plus-appearance tasks.

Abstract

Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such "all-in-one" designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.