Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

arXiv cs.RO / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that embodied AI needs active visual perception, where a robot actively chooses where and at what zoom level to look to maximize task-relevant information within sensing constraints.
It introduces a language-guided active perception task: given one RGB image and an instruction, the agent must predict PTZ (pan/tilt/zoom) camera adjustments to capture the most informative view for the task.
The authors propose EyeVLA, an autoregressive vision-language-action framework that unifies visual perception, language understanding, and physical camera control in a single model.
EyeVLA uses hierarchical action encoding to discretize/compactly tokenize continuous camera movements into the VLM’s token space, enabling joint multimodal reasoning over both perception and actions.
Using pseudo-labeling, iterative IoU-controlled data refinement, and reinforcement learning with GRPO, the method transfers from a pretrained VLM using only 500 real-world samples and reports a 96% average task completion rate across 50 scenes.

Abstract

In embodied AI, visual perception should be active rather than passive: the system must decide where to look and at what scale to sense to acquire maximally informative data under pixel and spatial budget constraints. Existing vision models coupled with fixed RGB-D cameras fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. We study the task of language-guided active visual perception: given a single RGB image and a natural language instruction, the agent must output pan, tilt, and zoom adjustments of a real PTZ (pan-tilt-zoom) camera to acquire the most informative view for the specified task. We propose EyeVLA, a unified framework that addresses this task by integrating visual perception, language understanding, and physical camera control within a single autoregressive vision-language-action model. EyeVLA introduces a semantically rich and efficient hierarchical action encoding that compactly tokenizes continuous camera adjustments and embeds them into the VLM vocabulary for joint multimodal reasoning. Through a data-efficient pipeline comprising pseudo-label generation, iterative IoU-controlled data refinement, and reinforcement learning with Group Relative Policy Optimization (GRPO), we transfer the open-world understanding of a pre-trained VLM to an embodied active perception policy using only 500 real-world samples. Evaluations on 50 diverse real-world scenes across five independent evaluation runs demonstrate that EyeVLA achieves an average task completion rate of 96%. Our work establishes a new paradigm for instruction-driven active visual information acquisition in multimodal embodied systems.