Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos
arXiv cs.CV / 4/7/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a Vision Large Language Model (VLLM) approach to anticipate human-object interactions from egocentric (first-person) video, targeting assistive systems that need both short- and long-term intent understanding.
- It improves visual grounding using a Set-of-Mark prompting strategy and infers user intent from the trajectory formed by recent gaze fixations.
- To capture the temporal dynamics right before interactions, the authors introduce an inverse exponential sampling strategy for selecting input video frames.
- Experiments on the HD-EPIC egocentric dataset show performance gains over state-of-the-art methods and highlight the model-agnostic nature of the approach.
Related Articles

Black Hat Asia
AI Business

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents
MarkTechPost

Chatbots are great at manipulating people to buy stuff, Princeton boffins find
The Register
I tested and ranked every ai companion app I tried and here's my honest breakdown
Reddit r/artificial

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to