VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

arXiv cs.CV / 4/24/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces VG-CoT, a dataset designed to make large vision-language models’ reasoning trustworthy by explicitly grounding each step in verifiable image evidence.
  • It addresses scalability issues in prior datasets by using a fully automated three-stage pipeline that extracts object/text evidence (detection + OCR), generates grounded step-by-step reasoning with GPT-4o, and refines grounding with a rationale-driven open-set detection process.
  • A new benchmark is proposed to evaluate LVLMs along three dimensions: rationale quality, answer accuracy, and reasoning–answer alignment.
  • Experiments with models such as LLaVA-1.5 and Qwen2-VL show improvements on most metrics, suggesting VG-CoT strengthens evidence-based reasoning without sacrificing cost efficiency.
  • The authors plan to release the dataset and code publicly after acceptance to support further research.

Abstract

The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.