Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multimodal LLMs still struggle with 3D spatial reasoning because they mainly rely on 2D visual priors rather than explicit 3D geometry and flexible viewpoints.
  • It proposes a training-free pipeline that uses an MLLM-guided “Visual Chain-of-Thought” to reconstruct a high-fidelity 3D mesh from a single image via multi-granularity keyword extraction and mask generation.
  • The method then uses an external knowledge base to iteratively estimate optimal camera extrinsic parameters and generate novel views, aiming to emulate human perspective-taking for multi-perspective reasoning.
  • Experiments report substantial improvements on benchmarks like 3DSRBench and Rel3D, outperforming both specialized spatial models and general-purpose MLLMs including GPT-5.2 and Gemini-2.5-Flash.
  • The approach avoids expensive post-training on limited 3D datasets by grounding reasoning in explicit 3D reconstruction and dynamic viewpoint synthesis rather than fixed tool-calling.

Abstract

Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.