Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding
arXiv cs.CV / 4/30/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces MCM-VG, a new framework for robust zero-shot 3D visual grounding that addresses issues caused by low-quality open-vocabulary 3D proposals.
- MCM-VG improves reliability by enforcing multiple consistent 2D-3D mappings using three components: semantic alignment (LLM-driven query parsing and coarse-to-fine matching), instance rectification (VLM-guided 2D segmentations for reconstructing missing targets and accurate 3D geometry), and viewpoint distillation (clustering camera directions to reduce redundant multi-view reasoning).
- The method formulates final target disambiguation as a multiple-choice reasoning task for vision-language models by pairing selected RGB frames with bird’s-eye-view maps as compact visual prompts.
- Experiments on ScanRefer and Nr3D show state-of-the-art performance, achieving 62.0% Acc@0.25 and 53.6% Acc@0.5 on ScanRefer, with gains of 6.4% and 4.0% over prior baselines.
- Overall, the work advances open-world embodied AI by enabling more precise and dependable zero-shot localization and reasoning in 3D environments.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to