VGR: Visual Grounded Reasoning
arXiv cs.CV / 5/4/2026
📰 NewsModels & Research
Key Points
- VGR is a new multimodal LLM for visual grounded reasoning that improves beyond language-only chain-of-thought approaches by addressing language bias and expanding visual reasoning capability.
- Instead of answering purely from language space, VGR first detects relevant image regions via bounding boxes and then produces answers using a replay mechanism that re-integrates those visual regions into the reasoning flow.
- The paper builds a large-scale SFT dataset (VGR-SFT) containing mixed vision grounding and language deduction to train the model for fine-grained visual understanding.
- Experiments on a LLaVA-NeXT-7B baseline show VGR outperforms on multimodal benchmarks requiring detailed image comprehension, while using only about 30% of image token counts.
- Reported gains include +4.1 on MMStar, +7.1 on AI2D, and +12.9 on ChartQA relative to the baseline.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B
Reddit r/LocalLLaMA