TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

TokenGS proposes improving feed-forward 3D Gaussian Splatting prediction by directly regressing 3D Gaussian mean coordinates rather than regressing depth along camera rays.
The method introduces an encoder-decoder design with learnable Gaussian tokens, decoupling the number of predicted 3D primitives from the input image resolution and the number of views.
Using only a self-supervised rendering loss, TokenGS aims to avoid suboptimal assumptions while learning robust representations for 3D reconstruction.
Experiments report stronger robustness to pose noise and multiview inconsistencies, with state-of-the-art feed-forward reconstruction on both static and dynamic scenes.
TokenGS is claimed to enable efficient test-time optimization in token space and to better recover higher-level scene attributes such as static-dynamic decomposition and scene flow.

Abstract

In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.