INHerit-SG: Incremental Hierarchical Semantic Scene Graphs with RAG-Style Retrieval

arXiv cs.RO / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • INHerit-SG is a new research framework for building hierarchical semantic scene graphs for robot navigation by structuring 3D environments into a RAG-ready knowledge base.
  • It uses an asynchronous dual-stream architecture with comprehensive node representations and event-triggered updates, while decoupling geometric segmentation from semantic reasoning to improve mapping efficiency.
  • Semantic nodes store natural-language summaries to enable text-based retrieval, and the approach includes an interpretable pipeline that combines multi-role LLM reasoning with the scene graph’s topology.
  • The system adds a visual verification step to reduce false positives during retrieval.
  • The method is evaluated on the newly built HM3DSem-SQR benchmark and in real-world settings, achieving state-of-the-art results for complex embodied queries, particularly those with negations and chained spatial constraints.

Abstract

Driven by recent advancements in foundation models, semantic scene graphs have emerged as a promising paradigm for high-level 3D environmental abstraction in robot navigation. However, existing frameworks struggle to successfully handle complex embodied queries while ensuring continuous semantic graph construction. To address these limitations, we present INHerit-SG, an asynchronous dual-stream architecture that systematically structures the 3D environment into a RAG-ready knowledge base. Specifically, our framework integrates comprehensive node representations, an event-triggered asynchronous update scheme, and a structured retrieval mechanism. While geometric segmentation is decoupled from semantic reasoning to maintain mapping efficiency, the semantic nodes also store natural language summaries to support text-based retrieval. Furthermore, we propose an interpretable retrieval pipeline that couples the reasoning capabilities of multi-role LLMs with the topological structure of the scene graph, followed by a visual verification process to mitigate false positives. We evaluate INHerit-SG on a newly constructed benchmark for complex embodied semantic query retrieval, HM3DSem-SQR, and in real-world environments. Experiments demonstrate that our system achieves state-of-the-art performance on complex queries, especially for those involving negations and chained spatial constraints. Project Page: https://fangyuktung.github.io/INHeritSG.github.io/