Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

arXiv cs.AI / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses how embodied agents can efficiently generate semantic scene graphs (SSGs) by navigating under limited action budgets, balancing information gain against navigation cost.
It introduces a modular navigation component for Embodied SSG generation and modernizes decision-making via a revised discrete action formulation and policy architecture choices (single-head atomic vs factorised multi-head over action components).
Experiments study compact vs finer-grained motion sets, evaluate curriculum learning, and optionally add depth-based collision supervision to improve safety.
Results indicate that swapping the optimization algorithm alone boosts SSG completeness by 21% versus the baseline under the same reward shaping, while depth supervision mainly improves execution safety rather than completeness.
The best performance comes from combining modern optimization with a finer-grained, factorised action representation, achieving the strongest completeness–efficiency trade-off.

Abstract

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21\% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness--efficiency trade-off.