A Multimodal Depth-Aware Method For Embodied Reference Understanding

arXiv cs.RO / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses Embodied Reference Understanding by identifying a target object using both language instructions and pointing cues, especially when scenes contain multiple plausible candidates.
  • It proposes a new ERU framework that combines LLM-based data augmentation with depth-map modality to strengthen performance in ambiguous, cluttered environments.
  • A depth-aware decision module is introduced to more effectively fuse linguistic and embodied signals for disambiguation.
  • Experiments on two datasets show the method achieves significantly better and more reliable referent detection than existing baselines.

Abstract

Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.

A Multimodal Depth-Aware Method For Embodied Reference Understanding | AI Navigate