VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles open-vocabulary 3D affordance detection by finding where interactions occur in point clouds using novel affordance descriptions.
  • It argues that prior multimodal LLM approaches relying on autoregressive special output tokens capture semantics well but fail to model spatial neighborhood relationships for precise localization.
  • VoxAfford addresses this by injecting multi-scale geometric features from a frozen 3D VQ-VAE encoder into the generated tokens via cross-attention, using learned gating to control how much geometry is injected.
  • The enhanced, spatially aware tokens are then aggregated into a semantic-conditioned affordance prompt and propagated with per-point features to produce final segmentation masks.
  • Experiments report state-of-the-art results, about an 8% mIoU improvement, and real-robot tests demonstrate zero-shot transfer to novel objects.

Abstract

Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate controlling the injection strength. The enhanced tokens are then aggregated into a spatially-aware affordance prompt through semantic-conditioned attention and propagated alongside per-point features to generate the final mask. Experiments on open-vocabulary affordance detection tasks show that VoxAfford achieves state-of-the-art performance with approximately an 8% improvement in mIoU, and real robot experiments confirm zero-shot transfer to novel objects.