GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

arXiv cs.AI / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes training dynamics to argue that conventional supervised fine-tuning (SFT) can be viewed as a fragile form of policy-gradient optimization, suffering from issues like sparse implicit rewards and unstable inverse-probability weighting.
It shows how these problems can cause single-path dependency, entropy collapse, and gradient explosion, limiting the ability to combine efficient knowledge injection with strong generalization.
To address this, the authors introduce Group Fine-Tuning (GFT), a unified post-training framework with two key components.
Group Advantage Learning forms diverse response groups and uses normalized contrastive supervision to reduce reward sparsity, while Dynamic Coefficient Rectification adaptively bounds inverse-probability weights to stabilize training.
Experiments report that GFT outperforms SFT-based approaches consistently and produces policies that integrate more smoothly with later reinforcement learning (RL).

Abstract

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.