Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles theoretical gaps in reinforcement fine-tuning with verifiable rewards (RLVR) for large vision-language models (LVLMs), especially Visual-ARFT’s convergence and transfer behavior.
  • It introduces the Tool-Augmented Markov Decision Process (TA-MDP) framework to formally model multimodal agent decision-making with bounded-depth tool calls.
  • The authors prove that Group Relative Policy Optimization (GRPO) with composite verifiable rewards converges to a first-order stationary point at an O(1/√T) rate, with explicit dependence on reward components and group size.
  • They derive a Reward Decomposition Theorem that quantifies when optimizing decomposed reward components is beneficial versus joint optimization.
  • Finally, the work provides a PAC-Bayes generalization bound that explains why training on small tool-augmented task sets transfers well to out-of-distribution domains.

Abstract

Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic decision-making with bounded-depth tool calls. Within this framework, we establish three main results. First, we prove that GRPO under composite verifiable rewards converges to a first-order stationary point at rate O(1/\sqrt{T}) with explicit dependence on the number of reward components and group size (\textbf{Theorem~1}). Second, we derive a \emph{Reward Decomposition Theorem} that bounds the sub-optimality gap between decomposed per-component optimization and joint optimization, providing a precise characterization of when reward decomposition is beneficial (\textbf{Theorem~2}). Third, we establish a PAC-Bayes generalization bound for tool-augmented policies that explains the strong out-of-distribution transfer observed in Visual-ARFT (\textbf{Theorem~3}).