SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

arXiv cs.CV / 4/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces SpaMEM, a new large-scale diagnostic benchmark for measuring how multimodal LLM/VLM systems maintain spatial coherence over long horizons in embodied environments where beliefs must be revised under change.
SpaMEM is based on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), derived from 25,000+ action sequences in 1,000 procedurally generated houses.
The benchmark defines a three-level hierarchy of embodied spatial reasoning with 15 tasks, ranging from atomic perception (single observations) to temporal reasoning using oracle textual histories, and finally to end-to-end belief maintenance from raw visual streams.
Experiments on representative open-source VLM families show a consistent “stacked bottleneck” in coordinate-consistent grounding and a sharp performance collapse from Level 2 to Level 3, suggesting strong reliance on symbolic/text-based bookkeeping rather than robust visual episodic memory.
The authors argue that SpaMEM enables fine-grained diagnosis of failure modes and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.

Abstract

Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dev.to

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

Key Points

Abstract

Related Articles

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer