S-GRPO: Unified Post-Training for Large Vision-Language Models
arXiv cs.LG / 4/21/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Current LVLM post-training methods (SFT and RL) each have drawbacks when used alone: SFT can cause catastrophic forgetting via distribution shift, while RL can suffer from cold-start/optimization collapse in sparse-reward visual tasks.
- The paper introduces S-GRPO, a unified framework that combines imitation-learning guidance with multi-trajectory preference optimization to improve both stability and exploration.
- S-GRPO uses Conditional Ground-Truth Trajectory Injection (CGI) for direct-generation visual tasks: when a verifier finds an exploratory failure across a sampled trajectory group, CGI injects the verified ground-truth trajectory into the candidate pool.
- By giving the injected anchor a deterministic maximal reward, S-GRPO ensures a strong positive learning signal in group-relative advantage estimation and reframes supervised learning as a high-advantage component of policy gradients.
- The authors report theoretical and empirical results showing faster convergence, better domain adaptation, and preserved general multimodal capabilities compared with applying SFT or RL in isolation.
Related Articles

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM
Reddit r/LocalLLaMA