Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
arXiv cs.LG / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard PPO post-training can be harmed by noisy or unfaithful episodes in the rollout buffer, which weakens optimization signals and slows training.
- It introduces Influence-Guided PPO (I-PPO), which uses gradient-based influence scoring to remove episodes whose trajectories are anti-aligned with a validation gradient.
- The filtering is designed to reduce unfaithful chain-of-thought (CoT) reasoning while improving overall model quality.
- Experiments reported in the study show I-PPO outperforms both SFT and PPO baselines, and the episode filtering functions as an intrinsic early-stopping mechanism to accelerate training efficiency.

