Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting

arXiv cs.CV / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets a key limitation in recent 3D scene understanding methods that use 2D masks from visual foundation models: the supervision is not inherently object-centric and can require extra processing or specialized training to avoid identity conflicts across views.
  • It proposes a dataset-level, scene-agnostic object-centric supervision scheme for 3D Gaussian Splatting (3DGS) that learns consistent object identity representations across both views and different scenes.
  • The approach builds on a pre-trained slot-attention-based Global Object Centric Learning (GOCL) module and introduces a scene-agnostic object codebook to anchor object identity features for supervision of 3D Gaussian identities directly.
  • By coupling the codebook with unsupervised object masks from the module, the method aims to remove the need for additional mask pre/post-processing or explicit multi-view alignment, and avoids per-scene fine-tuning or retraining.
  • The authors position the resulting unsupervised object-centric learning (OCL) in 3DGS as producing more structured representations with improved generalization for downstream tasks such as robotic interaction and scene understanding.

Abstract

Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module's unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.