Communicating about Space: Language-Mediated Spatial Integration Across Partial Views

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies whether multimodal LLM agents can collaborate through dialogue to integrate partial, egocentric observations into a coherent allocentric (shared) spatial understanding.
It introduces COSMIC, a benchmark with 899 indoor 3D scenes and 1250 QA pairs across five tasks where two static MLLM agents exchange natural-language messages to answer spatial queries.
Results show a capability hierarchy: models are strongest at grounding shared anchor objects across views but weaker at relational reasoning, and they largely fail to construct globally consistent maps (near chance even for frontier systems).
Adding “thinking” capability improves anchor grounding reliability, but does not meaningfully enable higher-level spatial communication or global consistency.
A comparison with 250 human-human dialogues finds humans reach far higher accuracy (95% vs. 72% for the best model, Gemini-3-Pro-Thinking) and converge on a shared mental model, while model dialogues tend to keep exploring rather than converging; code/data are released on GitHub.

Abstract

Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Language Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a consistent capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for the frontier models. Moreover, we find thinking capability yields consistent gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we additionally collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, leaving significant room for improvement for even the best performing model Gemini-3-Pro-Thinking which achieves 72% aggregate accuracy. Moreover, human conversations become increasingly specific as partners converge on a shared mental model, whereas model dialogues continue to explore new possibilities rather than converging, consistent with a limited ability to build and maintain a robust shared mental model. Our code and data is available at https://github.com/ankursikarwar/Cosmic