Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
arXiv cs.AI / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper argues that multimodal large language models (MLLMs) used in embodied agents often lack true spatial reasoning over long time and space, instead relying mainly on reactive planning from immediate observations.
- It introduces Video2Mental, a new benchmark that tests “mental navigation” by requiring hierarchical cognitive map construction from long egocentric videos and landmark-based path planning verified via simulator-based physical interaction.
- Benchmark results show that standard pre-training does not naturally produce mental navigation abilities, with zero-shot structured spatial representation performing poorly and planning accuracy degrading sharply over longer horizons.
- To address this, the authors propose NavMind, a reasoning model that uses explicit fine-grained cognitive maps as learnable intermediate representations and is trained via difficulty-stratified progressive supervised fine-tuning.
- Experiments indicate NavMind substantially outperforms frontier commercial and other spatial MLLMs on mental-navigation performance within the proposed evaluation framework.
