EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • 研究では、既存の空間推論改善が3Dプリアや幾何学的教師データに依存しがちでコストが高いことを指摘し、幾何学的な事前知識なしでの空間推論を狙う「EgoMind」を提案しています。
  • EgoMindはChain-of-Thoughtベースで、Role-Play Captionによりフレーム横断で整合した言語シーン・グラフを構築し、Progressive Spatial Analysisでタスク固有の問いへ段階的に推論を進めます。
  • 2Dのみのアプローチが抱えるマルチフレームの空間関係把握の難しさに対し、言語的推論によってクロスフレーム関係を扱えるようにする設計です。
  • 5Kの自動生成SFTサンプルと20KのRLサンプルという比較的小規模な学習で、VSI-Bench、SPAR-Bench、SITE-Bench、SPBenchで競争的な結果を示したと報告しています。
  • コードとデータが公開されており、MLLMの空間認知能力強化における「言語推論の有効性」を示す早期の研究シグナルになっています。

Abstract

Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.