VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining
arXiv cs.AI / 3/17/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- VTC-Bench is introduced as a comprehensive benchmark to evaluate tool-use proficiency in Visual Multimodal LLMs, featuring 32 OpenCV-based visual operations and 680 curated problems across a nine-category cognitive hierarchy.
- Experiments on 19 leading MLLMs show current models struggle to adapt to diverse tool sets, generalize to unseen operations, and compose multiple tools for complex tasks, with Gemini-3.0-Pro scoring only 51% on the benchmark.
- The benchmark aligns with realistic computer vision pipelines and provides ground-truth execution trajectories to enable rigorous assessment of multi-tool composition and long-horizon planning.
- By identifying these limitations, VTC-Bench establishes a baseline to guide the development of more generalized visual agentic models.
Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO
Dev.to

Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents
Dev.to

Perplexity Hub
Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide
Dev.to