Dynamic Graph Neural Network with Adaptive Features Selection for RGB-D Based Indoor Scene Recognition

arXiv cs.CV / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a dynamic graph neural network for RGB-D indoor scene recognition that adaptively selects informative nodes from both color (RGB) and depth modalities.
  • It builds a dynamic graph to model relations among objects/scenes and groups nodes into three levels to capture near-to-far relational structure.
  • The graph is updated dynamically using attention weights, allowing feature propagation and optimization to reflect which nodes/relations matter most.
  • Finally, it fuses the updated RGB and depth features for recognition, reporting improved performance over prior state-of-the-art methods on SUN RGB-D and NYU Depth v2.
  • The work targets the previously open challenge of adaptively exploiting crucial local features from multi-modal RGB-D inputs via graph modeling.

Abstract

Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to take key local features from both modalities of RGB and depth for graph modeling. After that, these nodes are grouped by three different levels, representing near or far relations among objects. Moreover, the graph model is updated dynamically according to attention weights. Finally, the updated and optimized features of RGB and depth modalities are fused together for indoor scene recognition. Experiments are performed on public datasets including SUN RGB-D and NYU Depth v2. Extensive results demonstrate that our method has superior performance when comparing to state-of-the-arts methods, and show that the proposed method is able to exploit crucial local features from both modalities of RGB and depth.