VGR: Visual Grounded Reasoning

arXiv cs.CV / 5/4/2026

📰 NewsModels & Research

Key Points

  • VGR is a new multimodal LLM for visual grounded reasoning that improves beyond language-only chain-of-thought approaches by addressing language bias and expanding visual reasoning capability.
  • Instead of answering purely from language space, VGR first detects relevant image regions via bounding boxes and then produces answers using a replay mechanism that re-integrates those visual regions into the reasoning flow.
  • The paper builds a large-scale SFT dataset (VGR-SFT) containing mixed vision grounding and language deduction to train the model for fine-grained visual understanding.
  • Experiments on a LLaVA-NeXT-7B baseline show VGR outperforms on multimodal benchmarks requiring detailed image comprehension, while using only about 30% of image token counts.
  • Reported gains include +4.1 on MMStar, +7.1 on AI2D, and +12.9 on ChartQA relative to the baseline.

Abstract

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.