Communicating about Space: Language-Mediated Spatial Integration Across Partial Views
arXiv cs.CV / 3/31/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies whether multimodal LLM agents can collaborate through dialogue to integrate partial, egocentric observations into a coherent allocentric (shared) spatial understanding.
- It introduces COSMIC, a benchmark with 899 indoor 3D scenes and 1250 QA pairs across five tasks where two static MLLM agents exchange natural-language messages to answer spatial queries.
- Results show a capability hierarchy: models are strongest at grounding shared anchor objects across views but weaker at relational reasoning, and they largely fail to construct globally consistent maps (near chance even for frontier systems).
- Adding “thinking” capability improves anchor grounding reliability, but does not meaningfully enable higher-level spatial communication or global consistency.
- A comparison with 250 human-human dialogues finds humans reach far higher accuracy (95% vs. 72% for the best model, Gemini-3-Pro-Thinking) and converge on a shared mental model, while model dialogues tend to keep exploring rather than converging; code/data are released on GitHub.



