Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents

arXiv cs.CV / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that structured spatial memory is crucial for long-horizon embodied navigation, and criticizes existing two-stage, offline memory reconstruction approaches for being overly geometry-focused and missing semantic landmarks.
  • It proposes ABot-Explorer, an online, RGB-only active exploration framework that unifies exploration and memory construction using Large Vision-Language Models (VLMs) to extract Semantic Navigational Affordances (SNA).
  • ABot-Explorer incorporates SNAs into a hierarchical SG-Memo to emulate human-like exploration logic by prioritizing structural transit nodes for efficient coverage.
  • The authors release a large-scale dataset extending InteriorGS with SNA and SG-Memo annotations, enabling research on semantic-aligned memory construction.
  • Experiments show ABot-Explorer achieves significantly better exploration efficiency and environment coverage than prior state-of-the-art methods, and its SG-Memo supports multiple downstream tasks effectively.

Abstract

Constructing structured spatial memory is essential for enabling long-horizon reasoning in complex embodied navigation tasks. Current memory construction predominantly relies on a decoupled, two-stage paradigm: agents first aggregate environmental data through exploration, followed by the offline reconstruction of spatial memory. However, this post-hoc and geometry-centric approach precludes agents from leveraging high-level semantic intelligence, often causing them to overlook navigationally critical landmarks (e.g., doorways and staircases) that serve as fundamental semantic anchors in human cognitive maps. To bridge this gap, we propose ABot-Explorer, a novel active exploration framework that unifies memory construction and exploration into an online, RGB-only process. At its core, ABot-Explorer leverages Large Vision-Language Models (VLMs) to distill Semantic Navigational Affordances (SNA), which act as cognitive-aligned anchors to guide the agent's movement. By dynamically integrating these SNAs into a hierarchical SG-Memo, ABot-Explorer mirrors human-like exploratory logic by prioritizing structural transit nodes to facilitate efficient coverage. To support this framework, we contribute a large-scale dataset extending InteriorGS with SNA and SG-Memo annotations. Experimental results demonstrate that ABot-Explorer significantly outperforms current state-of-the-art methods in both exploration efficiency and environment coverage, while the resulting SG-Memo is shown to effectively support diverse downstream tasks.