動的ヒューマンインザループEQAのための記憶誘導ビューリファインメント

arXiv cs.CV / 2026/3/11

Ideas & Deep AnalysisModels & Research

原文を読む →

共有:

要点

Embodied Question Answering（EQA）は、一時的でビュー依存の手がかりや遮蔽により知覚の非定常性が発生する、人間が多く存在する動的環境において課題に直面している。
本論文では、時間的変化や人間の活動下での体系的な評価を可能にする動的および静的なサブセットを含む新たなヒューマンインザループEQAデータセット「DynHiL-EQA」を導入する。
著者らは、関連性誘導によるビューリファインメントと選択的メモリアドミッションを組み合わせたトレーニング不要のフレームワーク「DIVRR」を提案し、遮蔽に対する堅牢性を向上させつつ、コンパクトなメモリで効率的な推論を維持する。
DynHiL-EQAおよびHM-EQAデータセット上の実験により、DIVRRが動的および静的シナリオの両方で既存手法を上回りつつ、推論速度とメモリ効率を保つことを示した。
本研究は、視点の曖昧さの解消や変化する環境下でのタスク関連視覚証拠の効率的管理など、EQAエージェントにおける重要な実用的課題に取り組む。

Computer Science > Computer Vision and Pattern Recognition

arXiv:2603.09541 (cs)

[Submitted on 10 Mar 2026]

Title:Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA

Authors:Xin Lu, Rui Li, Xun Huang, Weixin Li, Chuanqing Zhuang, Jiayuan Li, Zhengda Lu, Jun Xiao, Yunhong Wang

View a PDF of the paper titled Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA, by Xin Lu and 8 other authors

View PDF HTML (experimental)

Abstract:Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above challenges, we present DIVRR (Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection), a training-free framework that couples relevance-guided view refinement with selective memory admission. By verifying ambiguous observations before committing them and retaining only informative evidence, DIVRR improves robustness under occlusions while preserving fast inference with compact memory. Extensive experiments on DynHiL-EQA and the established HM-EQA dataset demonstrate that DIVRR consistently improves over existing baselines in both dynamic and static settings while maintaining high inference efficiency.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2603.09541 [cs.CV]
	(or arXiv:2603.09541v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2603.09541 Focus to learn more arXiv-issued DOI via DataCite

Submission history

From: Xin Lu [view email]
[v1] Tue, 10 Mar 2026 11:51:54 UTC (2,359 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA, by Xin Lu and 8 other authors

View PDF
HTML (experimental)
TeX Source

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2026-03

Change to browse by:

cs
cs.MM

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

人型ロボットを被災建築物の調査に活用、建築研究所などが公開実験

日経XTECH

「ハード回帰にあらず、デバイスはAIの五感と身体」オムロン技術トップ

日経XTECH

ホンダEV3車種の開発中止、損失はなぜこれほど膨らんだのか

日経XTECH

AIで人月商売はもう終わり、人売りベンダーの技術者は速やかに逃げ出せ

日経XTECH

文書の内容を学習なしでLLMに反映、Sakana AIの新技術 RAG代替は可能か

日経XTECH

動的ヒューマンインザループEQAのための記憶誘導ビューリファインメント

要点

Computer Science > Computer Vision and Pattern Recognition

Title:Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

関連記事

人型ロボットを被災建築物の調査に活用、建築研究所などが公開実験

「ハード回帰にあらず、デバイスはAIの五感と身体」オムロン技術トップ

ホンダEV3車種の開発中止、損失はなぜこれほど膨らんだのか

AIで人月商売はもう終わり、人売りベンダーの技術者は速やかに逃げ出せ

文書の内容を学習なしでLLMに反映、Sakana AIの新技術 RAG代替は可能か

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer