Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key bottleneck for vision-language models: high-resolution inputs greatly increase visual-token counts and therefore compute overhead.
It proposes “Foveated Reasoner,” an autoregressive vision-language framework that performs foveation and reasoning within a single decoding trajectory by starting from low resolution and selectively requesting high-resolution evidence.
The model decides when to foveate, retrieves high-acuity information from chosen regions, and injects that evidence back into the same ongoing generation process.
Training uses a two-stage approach: cold-start supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve region selection and task accuracy while avoiding trivial strategies that “see everything.”
Experiments across multiple vision-language benchmarks show improved accuracy under strict visual-token budgets and evidence that the learned foveation policies are effective.

Abstract

Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.

AI Wolf Photo Arrest Sparks Legal Debate in South Korea

Dev.to

Introducing talkie: a 13B vintage language model from 1930

Simon Willison's Blog

I Tested 70 AI Agent Services. The Average Quality Score Was 34 Out of 100.

Dev.to

I built a solo AI platform from Bahrain with no funding, no team and no ad spend - here's what's inside it after 4 months

Reddit r/artificial

Methodology Magic: Using AI to Strengthen Your Project Plans

Dev.to

Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

Key Points

Abstract

Related Articles

AI Wolf Photo Arrest Sparks Legal Debate in South Korea

Introducing talkie: a 13B vintage language model from 1930

I Tested 70 AI Agent Services. The Average Quality Score Was 34 Out of 100.

I built a solo AI platform from Bahrain with no funding, no team and no ad spend - here's what's inside it after 4 months

Methodology Magic: Using AI to Strengthen Your Project Plans

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer