Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tackles theoretical gaps in reinforcement fine-tuning with verifiable rewards (RLVR) for large vision-language models (LVLMs), especially Visual-ARFT’s convergence and transfer behavior.
It introduces the Tool-Augmented Markov Decision Process (TA-MDP) framework to formally model multimodal agent decision-making with bounded-depth tool calls.
The authors prove that Group Relative Policy Optimization (GRPO) with composite verifiable rewards converges to a first-order stationary point at an O(1/√T) rate, with explicit dependence on reward components and group size.
They derive a Reward Decomposition Theorem that quantifies when optimizing decomposed reward components is beneficial versus joint optimization.
Finally, the work provides a PAC-Bayes generalization bound that explains why training on small tool-augmented task sets transfers well to out-of-distribution domains.

Abstract

Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic decision-making with bounded-depth tool calls. Within this framework, we establish three main results. First, we prove that GRPO under composite verifiable rewards converges to a first-order stationary point at rate

O(1/\sqrt{T})

with explicit dependence on the number of reward components and group size (\textbf{Theorem~1}). Second, we derive a \emph{Reward Decomposition Theorem} that bounds the sub-optimality gap between decomposed per-component optimization and joint optimization, providing a precise characterization of when reward decomposition is beneficial (\textbf{Theorem~2}). Third, we establish a PAC-Bayes generalization bound for tool-augmented policies that explains the strong out-of-distribution transfer observed in Visual-ARFT (\textbf{Theorem~3}).

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.

Dev.to

Training ChatGPT on Private Data: A Technical Reference

Dev.to

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development

Dev.to

The Anatomy of a Modern AI Marketing Curriculum in 2026 — What It Covers and Why It Matters

Dev.to

AI as a Fascist Artifact

Dev.to

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

Key Points

Abstract

Related Articles

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.

Training ChatGPT on Private Data: A Technical Reference

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development

The Anatomy of a Modern AI Marketing Curriculum in 2026 — What It Covers and Why It Matters

AI as a Fascist Artifact

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer