Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

arXiv cs.CV / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an agentic “Think, Act, Build (TAB)” framework for zero-shot 3D visual grounding that avoids the common static pipeline of matching against preprocessed 3D point-cloud proposals.
  • TAB decouples the problem by using a vision-language model agent to interpret spatial semantics in 2D while using deterministic multi-view geometry to reconstruct and instantiate the 3D structure directly from raw RGB-D streams.
  • To address multi-view coverage gaps from semantic-only 2D tracking, the authors introduce “Semantic-Anchored Geometric Expansion,” which anchors the target in a reference clip and propagates its 3D location to unobserved frames via geometric camera reasoning.
  • The work also critiques benchmark evaluation—highlighting issues like reference ambiguity and category errors—and improves test queries to enable more rigorous assessment.
  • Experiments on ScanRefer and Nr3D report substantial gains over prior zero-shot methods and results that can exceed fully supervised baselines, using only open-source components.

Abstract

3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.