SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

arXiv cs.CV / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces SpaCeFormer, a proposal-free approach to open-vocabulary 3D instance segmentation designed for robotics and AR/VR use cases.
  • SpaCeFormer achieves real-time performance by running at 0.14 seconds per scene, addressing the latency bottleneck of slower multi-stage 2D+3D pipelines that can take hundreds of seconds per scene.
  • The authors also release SpaCeFormer-3M, a large open-vocabulary 3D instance segmentation dataset with 3.0M multi-view-consistent captions covering 604K instances across 7.4K scenes, constructed via multi-view mask clustering and VLM captioning.
  • The method uses spatial window attention plus Morton-curve serialization for coherent 3D features, and a RoPE-enhanced decoder that predicts instance masks directly from learned queries without external region proposals.
  • Experiments show strong improvements, including 11.1 zero-shot mAP on ScanNet200 (2.8x over the prior best proposal-free method) and 22.9/24.1 mAP on ScanNet++ and Replica, surpassing prior methods even those using multi-view 2D inputs.

Abstract

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.