Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
arXiv cs.CV / 3/19/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- AwaRes is a spatial-on-demand framework for Vision-Language Models that achieves high accuracy while remaining efficient by operating on a low-resolution global view and selectively retrieving high-resolution crops only where needed for a query.
- The approach uses a judge to automatically decide if cropping is required by comparing low- and high-resolution answers and an oracle grounding model to localize evidence, mapping results to a discrete crop set for multi-turn tool-use trajectories.
- Training combines cold-start supervised fine-tuning (SFT) followed by multi-turn GRPO with a composite reward that penalizes crop costs while rewarding semantic correctness.
- The method aims to preserve small but important details (like text) in VLMs while reducing computational costs, and the project page is provided.
Related Articles
Is AI becoming a bubble, and could it end like the dot-com crash?
Reddit r/artificial

Externalizing State
Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.
Dev.to

My AI Does Not Have a Clock
Dev.to
How to settle on a coding LLM ? What parameters to watch out for ?
Reddit r/LocalLLaMA