FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation

arXiv cs.CV / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces FAST3DIS, an end-to-end feed-forward Transformer approach for 3D instance segmentation that avoids the common “lift-and-cluster” pipeline used by many prior feed-forward 3D reconstruction methods.
FAST3DIS uses a 3D-anchored, query-based Transformer with a learned 3D anchor generator and anchor-sampling cross-attention to project object queries into multi-view feature maps for efficient, view-consistent instance prediction.
The method retains zero-shot geometric priors from a depth backbone while adapting to learn instance-specific semantics directly rather than relying on non-differentiable clustering.
It adds dual-level regularization combining multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to prevent query collisions and improve boundary precision.
Experiments on complex indoor 3D datasets show competitive segmentation accuracy with improved memory scalability and faster inference than clustering-based state-of-the-art methods.

Abstract

While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed "lift-and-cluster" paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism for view-consistent 3D instance segmentation. By projecting 3D object queries directly into multi-view feature maps, our method samples context efficiently. Furthermore, we introduce a dual-level regularization strategy, that couples multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state-of-the-art clustering-based methods.

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation

Key Points

Abstract

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer