Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

arXiv cs.CV / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 参照物体検出(ROD)の多くの既存モデルはデータ豊富な前提で設計されているが、ロボティクスやARなどではラベル不足が深刻になり得るため、データ効率の改善が課題だと指摘しています。
  • 低データ・few-shot条件でROD性能を測るためのベンチマークプロトコルとして、Data-efficient Referring Object Detection(De-ROD)を提案しています。
  • HeROD(Heuristic-inspired ROD)は、参照フレーズから導く解釈可能な空間・意味の“推論の事前”を、DETR系パイプラインの3段階(提案ランク付け、予測融合、Hungarianマッチング)に軽量に注入する、モデル非依存の枠組みです。
  • RefCOCO/RefCOCO+/RefCOCOgの低ラベル条件で、HeRODは強力なグラウンディング基線を一貫して上回り、学習の収束性とラベル効率の向上が示されています。
  • 研究全体として、単純で解釈可能な推論プリオリティを組み込むことが、視覚と言語のデータ効率理解に向けた実用的かつ拡張可能な方策になり得ると結論づけています。

Abstract

Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.

Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection | AI Navigate