動的ヒューマンインザループEQAのための記憶誘導ビューリファインメント

arXiv cs.CV / 2026/3/11

Ideas & Deep AnalysisModels & Research

要点

  • Embodied Question Answering(EQA)は、一時的でビュー依存の手がかりや遮蔽により知覚の非定常性が発生する、人間が多く存在する動的環境において課題に直面している。
  • 本論文では、時間的変化や人間の活動下での体系的な評価を可能にする動的および静的なサブセットを含む新たなヒューマンインザループEQAデータセット「DynHiL-EQA」を導入する。
  • 著者らは、関連性誘導によるビューリファインメントと選択的メモリアドミッションを組み合わせたトレーニング不要のフレームワーク「DIVRR」を提案し、遮蔽に対する堅牢性を向上させつつ、コンパクトなメモリで効率的な推論を維持する。
  • DynHiL-EQAおよびHM-EQAデータセット上の実験により、DIVRRが動的および静的シナリオの両方で既存手法を上回りつつ、推論速度とメモリ効率を保つことを示した。
  • 本研究は、視点の曖昧さの解消や変化する環境下でのタスク関連視覚証拠の効率的管理など、EQAエージェントにおける重要な実用的課題に取り組む。

Computer Science > Computer Vision and Pattern Recognition

arXiv:2603.09541 (cs)
[Submitted on 10 Mar 2026]

Title:Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA

View a PDF of the paper titled Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA, by Xin Lu and 8 other authors
View PDF HTML (experimental)
Abstract:Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above challenges, we present DIVRR (Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection), a training-free framework that couples relevance-guided view refinement with selective memory admission. By verifying ambiguous observations before committing them and retaining only informative evidence, DIVRR improves robustness under occlusions while preserving fast inference with compact memory. Extensive experiments on DynHiL-EQA and the established HM-EQA dataset demonstrate that DIVRR consistently improves over existing baselines in both dynamic and static settings while maintaining high inference efficiency.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as: arXiv:2603.09541 [cs.CV]
  (or arXiv:2603.09541v1 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2603.09541
Focus to learn more
arXiv-issued DOI via DataCite

Submission history

From: Xin Lu [view email]
[v1] Tue, 10 Mar 2026 11:51:54 UTC (2,359 KB)
Full-text links:

Access Paper:

Current browse context:
cs.CV
< prev   |   next >
Change to browse by:

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos

Demos

Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers

Recommenders and Search Tools

Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.