GazeVLA: Learning Human Intention for Robotic Manipulation
arXiv cs.RO / 4/27/2026
📰 NewsModels & Research
Key Points
- GazeVLA proposes bridging the “embodiment gap” between humans and robots by using human intention as an intermediate representation for robotic manipulation.
- The method models intention from gaze, treating it as an observable signal that naturally precedes physical actions and can be transferred to robot behavior.
- GazeVLA is pretrained on a large-scale egocentric human dataset to learn intention and its relationship with actions, then fine-tuned with a small set of robot and human data.
- During inference, it uses a Chain-of-Thought-style process to predict intention sequentially before executing actions.
- Experiments in simulation and real-world tests across long-horizon, fine-grained, few-shot, and robustness benchmarks show consistent gains over strong baselines and state-of-the-art results.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to
We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to