ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation

arXiv cs.RO / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ReMemNav, a hierarchical, memory-augmented framework for zero-shot object navigation that targets failures of current vision-language models such as spatial hallucinations, local exploration deadlocks, and semantic-to-control disconnects.
  • ReMemNav anchors VLM spatial reasoning using a “Recognize Anything Model” and adds an adaptive dual-modal rethinking mechanism driven by an episodic semantic buffer to verify target visibility and correct decisions from historical memory.
  • For low-level control, it computes feasible action sequences using depth masks so the VLM can choose an action mapped to concrete spatial movement.
  • Experiments on HM3D and MP3D show ReMemNav improves both success rate (SR) and path efficiency (SPL) over training-free zero-shot baselines, with reported absolute gains varying by dataset split.
  • Overall, the work demonstrates that combining panoramic semantic priors, episodic memory, and depth-guided action feasibility can substantially improve zero-shot navigation performance without task-specific training.

Abstract

Zero-shot object navigation requires agents to locate unseen target objects in unfamiliar environments without prior maps or task-specific training which remains a significant challenge. Although recent advancements in vision-language models(VLMs) provide promising commonsense reasoning capabilities for this task, these models still suffer from spatial hallucinations, local exploration deadlocks, and a disconnect between high-level semantic intent and low-level control. In this regard, we propose a novel hierarchical navigation framework named ReMemNav, which seamlessly integrates panoramic semantic priors and episodic memory with VLMs. We introduce the Recognize Anything Model to anchor the spatial reasoning process of the VLM. We also design an adaptive dual-modal rethinking mechanism based on an episodic semantic buffer queue. The proposed mechanism actively verifies target visibility and corrects decisions using historical memory to prevent deadlocks. For low-level action execution, ReMemNav extracts a sequence of feasible actions using depth masks, allowing the VLM to select the optimal action for mapping into actual spatial movement. Extensive evaluations on HM3D and MP3D demonstrate that ReMemNav outperforms existing training-free zero-shot baselines in both success rate and exploration efficiency. Specifically, we achieve significant absolute performance improvements, with SR and SPL increasing by 1.7% and 7.0% on HM3D v0.1, 18.2% and 11.1% on HM3D v0.2, and 8.7% and 7.9% on MP3D.