FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation

arXiv cs.CV / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FAST3DIS, an end-to-end feed-forward Transformer approach for 3D instance segmentation that avoids the common “lift-and-cluster” pipeline used by many prior feed-forward 3D reconstruction methods.
  • FAST3DIS uses a 3D-anchored, query-based Transformer with a learned 3D anchor generator and anchor-sampling cross-attention to project object queries into multi-view feature maps for efficient, view-consistent instance prediction.
  • The method retains zero-shot geometric priors from a depth backbone while adapting to learn instance-specific semantics directly rather than relying on non-differentiable clustering.
  • It adds dual-level regularization combining multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to prevent query collisions and improve boundary precision.
  • Experiments on complex indoor 3D datasets show competitive segmentation accuracy with improved memory scalability and faster inference than clustering-based state-of-the-art methods.

Abstract

While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed "lift-and-cluster" paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism for view-consistent 3D instance segmentation. By projecting 3D object queries directly into multi-view feature maps, our method samples context efficiently. Furthermore, we introduce a dual-level regularization strategy, that couples multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state-of-the-art clustering-based methods.