Learning-Based Hierarchical Scene Graph Matching for Robot Localization Leveraging Prior Maps

arXiv cs.RO / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper proposes a learning-based, end-to-end differentiable pipeline for matching hierarchical scene graphs to improve indoor robot localization and correct SLAM drift using prior maps (e.g., BIM-derived representations).
  • It addresses the scalability and correspondence-matching challenge by moving away from expensive combinatorial node-to-node matching and by exploiting the multi-level semantic hierarchy rather than only flat graph matching.
  • The method augments both the online sensor-derived and offline prior graphs with semantically motivated edge types that represent intra- and inter-level relationships, enabling simultaneous alignment from high-level room concepts to low-level wall surfaces.
  • Trained solely on floor plans, the approach achieves better F1 than a combinatorial baseline on real LiDAR environments while running about an order of magnitude faster, indicating practical zero-shot generalization to BIM-assisted localization.
  • Overall, it demonstrates that hierarchical, semantics-aware graph matching can reliably connect robot observations to known structural priors for more robust localization.

Abstract

Accurate localization is a fundamental requirement for autonomous robots operating in indoor environments. Scene graphs encode the spatial structure of an environment as a hierarchy of semantic entities and their relationships, and can be constructed both online from robot sensor data and offline from architectural priors such as Building Information Models (BIM). Matching these two complementary representations enables drift correction in SLAM by grounding robot observations against a known structural prior. However, establishing reliable node-to-node correspondences between them remains an open challenge: existing combinatorial methods are prohibitively expensive at scale, and prior learned approaches address only flat graph matching, ignoring the multi-level semantic structure present in both representations. Here we present a learned, end-to-end differentiable pipeline that augments both graphs with semantically motivated edge types encoding intra- and inter- level relationships, explicitly exploiting this hierarchy to enable simultaneous matching from high-level room concepts down to low-level wall surfaces. Trained exclusively on floor plans, the proposed method outperforms the combinatorial baseline in F1 on real LiDAR environments while running an order of magnitude faster, demonstrating viable zero-shot generalization for BIM-assisted robot localization.