S-GRPO: Unified Post-Training for Large Vision-Language Models

arXiv cs.LG / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Current LVLM post-training methods (SFT and RL) each have drawbacks when used alone: SFT can cause catastrophic forgetting via distribution shift, while RL can suffer from cold-start/optimization collapse in sparse-reward visual tasks.
  • The paper introduces S-GRPO, a unified framework that combines imitation-learning guidance with multi-trajectory preference optimization to improve both stability and exploration.
  • S-GRPO uses Conditional Ground-Truth Trajectory Injection (CGI) for direct-generation visual tasks: when a verifier finds an exploratory failure across a sampled trajectory group, CGI injects the verified ground-truth trajectory into the candidate pool.
  • By giving the injected anchor a deterministic maximal reward, S-GRPO ensures a strong positive learning signal in group-relative advantage estimation and reframes supervised learning as a high-advantage component of policy gradients.
  • The authors report theoretical and empirical results showing faster convergence, better domain adaptation, and preserved general multimodal capabilities compared with applying SFT or RL in isolation.

Abstract

Current post-training methodologies for adapting Large Vision-Language Models (LVLMs) generally fall into two paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Despite their prevalence, both approaches suffer from inefficiencies when applied in isolation. SFT forces the model's generation along a single expert trajectory, often inducing catastrophic forgetting of general multimodal capabilities due to distributional shifts. Conversely, RL explores multiple generated trajectories but frequently encounters optimization collapse - a cold-start problem where an unaligned model fails to spontaneously sample any domain-valid trajectories in sparse-reward visual tasks. In this paper, we propose Supervised Group Relative Policy Optimization (S-GRPO), a unified post-training framework that integrates the guidance of imitation learning into the multi-trajectory exploration of preference optimization. Tailored for direct-generation visual tasks, S-GRPO introduces Conditional Ground-Truth Trajectory Injection (CGI). When a binary verifier detects a complete exploratory failure within a sampled group of trajectories, CGI injects the verified ground-truth trajectory into the candidate pool. By assigning a deterministic maximal reward to this injected anchor, S-GRPO enforces a positive signal within the group-relative advantage estimation. This mechanism reformulates the supervised learning objective as a high-advantage component of the policy gradient, compelling the model to dynamically balance between exploiting the expert trajectory and exploring novel visual concepts. Theoretical analysis and empirical results demonstrate that S-GRPO gracefully bridges the gap between SFT and RL, drastically accelerates convergence, and achieves superior domain adaptation while preserving the base model's general-purpose capabilities.