Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a Vision Large Language Model (VLLM) approach to anticipate human-object interactions from egocentric (first-person) video, targeting assistive systems that need both short- and long-term intent understanding.
  • It improves visual grounding using a Set-of-Mark prompting strategy and infers user intent from the trajectory formed by recent gaze fixations.
  • To capture the temporal dynamics right before interactions, the authors introduce an inverse exponential sampling strategy for selecting input video frames.
  • Experiments on the HD-EPIC egocentric dataset show performance gains over state-of-the-art methods and highlight the model-agnostic nature of the approach.

Abstract

The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.