Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
arXiv cs.CV / 4/2/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes an agentic “Think, Act, Build (TAB)” framework for zero-shot 3D visual grounding that avoids the common static pipeline of matching against preprocessed 3D point-cloud proposals.
- TAB decouples the problem by using a vision-language model agent to interpret spatial semantics in 2D while using deterministic multi-view geometry to reconstruct and instantiate the 3D structure directly from raw RGB-D streams.
- To address multi-view coverage gaps from semantic-only 2D tracking, the authors introduce “Semantic-Anchored Geometric Expansion,” which anchors the target in a reference clip and propagates its 3D location to unobserved frames via geometric camera reasoning.
- The work also critiques benchmark evaluation—highlighting issues like reference ambiguity and category errors—and improves test queries to enable more rigorous assessment.
- Experiments on ScanRefer and Nr3D report substantial gains over prior zero-shot methods and results that can exceed fully supervised baselines, using only open-source components.
Related Articles

Black Hat Asia
AI Business

Unitree's IPO
ChinaTalk

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖
Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to
A bug in Bun may have been the root cause of the Claude Code source code leak.
Reddit r/LocalLLaMA