VisDoT: 人間のような解釈によるグラウンディングと思考の分解を通じた視覚推論の強化

Key Points

VisDoT frames four perceptual tasks based on graphical perception theory to better ground visual primitives such as position and length for chart understanding.

Abstract

Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.

VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Key Points

Abstract

Related Articles

How to Build an AI Team: The Solopreneur Playbook

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Experiment: How far can a 28M model go in business email generation?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Related Articles

How to Build an AI Team: The Solopreneur Playbook
Dev.to

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use
Dev.to

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026
Dev.to

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it
Reddit r/MachineLearning

Experiment: How far can a 28M model go in business email generation?
Reddit r/LocalLLaMA