Visual Reasoning through Tool-supervised Reinforcement Learning

arXiv cs.CV / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how multimodal large language models can learn to use visual tools effectively to solve complex visual reasoning tasks.
It introduces a new Tool-supervised Reinforcement Learning (ToolsRL) framework that provides direct supervision signals for tool use, making tool learning more effective.
The approach uses simple, native, and interpretable visual tools (e.g., zoom, rotate, flip, and drawing point/line) whose supervision data is relatively easy to collect.
A two-stage reinforcement learning curriculum is proposed: first learn tool-calling skills using tool-specific rewards, then train for visual-reasoning accuracy while allowing tool calls, reducing conflicts between different optimization goals.
Experiments indicate that the tool-supervised curriculum improves training efficiency and enables strong tool-use capabilities for complex visual reasoning.

Abstract

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.

Dev.to

Training ChatGPT on Private Data: A Technical Reference

Dev.to

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development

Dev.to

The Anatomy of a Modern AI Marketing Curriculum in 2026 — What It Covers and Why It Matters

Dev.to

AI as a Fascist Artifact

Dev.to

Visual Reasoning through Tool-supervised Reinforcement Learning

Key Points

Abstract

Related Articles

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.

Training ChatGPT on Private Data: A Technical Reference

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development

The Anatomy of a Modern AI Marketing Curriculum in 2026 — What It Covers and Why It Matters

AI as a Fascist Artifact

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer