Contextual Graph Representations for Task-Driven 3D Perception and Planning

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that 3D scene graphs (hierarchical, dense object-relation representations extracted from visual-inertial data) can improve robot task planning but become impractically large because tasks need only small subsets of objects/relations.
  • It evaluates whether existing embodied AI environments are suitable for research combining robot task planning and 3D scene graphs, and introduces a benchmark to compare state-of-the-art classical planners.
  • The thesis studies graph neural network approaches to learn contextual graph representations that capture relevant relational invariances, aiming to reduce state space complexity and enable faster planning.
  • Overall, it positions contextual graph representations as a path toward making scene-graph-based planning more deployable in resource-constrained robotic settings.

Abstract

Recent advances in computer vision facilitate fully automatic extraction of object-centric relational representations from visual-inertial data. These state representations, dubbed 3D scene graphs, are a hierarchical decomposition of real-world scenes with a dense multiplex graph structure. While 3D scene graphs claim to promote efficient task planning for robot systems, they contain numerous objects and relations when only small subsets are required for a given task. This magnifies the state space that task planners must operate over and prohibits deployment in resource constrained settings. This thesis tests the suitability of existing embodied AI environments for research at the intersection of robot task planning and 3D scene graphs and constructs a benchmark for empirical comparison of state-of-the-art classical planners. Furthermore, we explore the use of graph neural networks to harness invariances in the relational structure of planning domains and learn representations that afford faster planning.