MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

arXiv cs.CV / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MAG-3D, a training-free multi-agent framework aimed at improving grounded reasoning in 3D scenes using off-the-shelf vision-language models (VLMs).
  • MAG-3D uses three coordinated expert agents—planning, grounding, and coding—to decompose tasks, identify query-relevant 3D regions/objects, and perform geometric reasoning with explicit verification.
  • The grounding agent performs free-form 3D grounding and retrieves relevant frames from large 3D scene observations to support open-ended queries.
  • The coding agent executes generated programs to verify geometric reasoning steps, addressing reliability issues common in fixed or hand-crafted pipelines.
  • The authors report state-of-the-art results on challenging 3D grounded reasoning benchmarks and emphasize improved flexibility and zero-shot generalization to novel environments versus in-domain tuned methods.

Abstract

Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.