Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key bottleneck for vision-language models: high-resolution inputs greatly increase visual-token counts and therefore compute overhead.
  • It proposes “Foveated Reasoner,” an autoregressive vision-language framework that performs foveation and reasoning within a single decoding trajectory by starting from low resolution and selectively requesting high-resolution evidence.
  • The model decides when to foveate, retrieves high-acuity information from chosen regions, and injects that evidence back into the same ongoing generation process.
  • Training uses a two-stage approach: cold-start supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve region selection and task accuracy while avoiding trivial strategies that “see everything.”
  • Experiments across multiple vision-language benchmarks show improved accuracy under strict visual-token budgets and evidence that the learned foveation policies are effective.

Abstract

Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.