Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a Vision Large Language Model (VLLM) approach to anticipate human-object interactions from egocentric (first-person) video, targeting assistive systems that need both short- and long-term intent understanding.
It improves visual grounding using a Set-of-Mark prompting strategy and infers user intent from the trajectory formed by recent gaze fixations.
To capture the temporal dynamics right before interactions, the authors introduce an inverse exponential sampling strategy for selecting input video frames.
Experiments on the HD-EPIC egocentric dataset show performance gains over state-of-the-art methods and highlight the model-agnostic nature of the approach.

Abstract

The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.

Black Hat Asia

AI Business

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

MarkTechPost

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

The Register

I tested and ranked every ai companion app I tried and here's my honest breakdown

Reddit r/artificial

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

Key Points

Abstract

Related Articles

Black Hat Asia

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

I tested and ranked every ai companion app I tried and here's my honest breakdown

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer