Ego-Grounding for Personalized Question-Answering in Egocentric Videos

arXiv cs.CV / 4/3/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper provides a first systematic evaluation of multimodal LLMs for personalized question answering in egocentric (camera-wearer) videos, focusing on the ability to perform “ego-grounding.”
It introduces MyEgo, a new egocentric VideoQA dataset with 541 long videos and about 5K personalized questions about “my things,” “my activities,” and “my past,” along with an analysis benchmarked across multiple MLLM variants.
Results show that even top closed- and open-source MLLMs (e.g., GPT-5 and Qwen3-VL) perform poorly on MyEgo, reaching roughly ~46% (closed) and ~36% (open) accuracy and falling far behind human performance.
The study finds that explicit reasoning and larger model scaling do not consistently improve performance, while providing relevant evidence helps but improvements diminish over time, suggesting weaknesses in tracking and long-range memory about the “me” identity and past context.
The authors conclude that ego-grounding and long-range memory are key missing capabilities for personalized egocentric assistance and release the data/code to spur further research.

Abstract

We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/3DailyView insight →

Black Hat Asia

AI Business

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

MarkTechPost

Portable eye scanner powered by AI expands access to low-cost community screening

Reddit r/artificial

Ego-Grounding for Personalized Question-Answering in Egocentric Videos

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

Portable eye scanner powered by AI expands access to low-cost community screening

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer