Visual Reasoning through Tool-supervised Reinforcement Learning
arXiv cs.CV / 4/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how multimodal large language models can learn to use visual tools effectively to solve complex visual reasoning tasks.
- It introduces a new Tool-supervised Reinforcement Learning (ToolsRL) framework that provides direct supervision signals for tool use, making tool learning more effective.
- The approach uses simple, native, and interpretable visual tools (e.g., zoom, rotate, flip, and drawing point/line) whose supervision data is relatively easy to collect.
- A two-stage reinforcement learning curriculum is proposed: first learn tool-calling skills using tool-specific rewards, then train for visual-reasoning accuracy while allowing tool calls, reducing conflicts between different optimization goals.
- Experiments indicate that the tool-supervised curriculum improves training efficiency and enables strong tool-use capabilities for complex visual reasoning.
Related Articles

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.
Dev.to

Training ChatGPT on Private Data: A Technical Reference
Dev.to

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development
Dev.to

The Anatomy of a Modern AI Marketing Curriculum in 2026 — What It Covers and Why It Matters
Dev.to
AI as a Fascist Artifact
Dev.to