Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

arXiv cs.RO / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that long-horizon robotic manipulation becomes non-Markovian at decision time when occlusion and state changes cause perceptual aliasing, requiring memory for reliable action selection.
  • It introduces Chameleon, a human-inspired episodic memory approach that writes geometry-grounded multimodal tokens and uses a differentiable memory stack for goal-directed recall.
  • The method aims to retain disambiguating fine-grained cues that similarity-based memory retrieval often discards, reducing retrieval of decision-irrelevant but perceptually similar episodes.
  • The authors release Camo-Dataset, a real-robot UR5e dataset covering episodic recall, spatial tracking, and sequential manipulation specifically under perceptual aliasing conditions.
  • Experiments report consistent improvements in decision reliability and long-horizon control over strong baselines in perceptually confusable settings.

Abstract

Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.