Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization
arXiv cs.LG / 3/20/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper identifies a squeezing effect in Direct Preference Optimization (DPO) where the probability of preferred responses declines during training due to high-curvature directions in logit space and negative-gradient updates.
- It develops a theoretical framework that models coordinate-wise dynamics in logit space to explain how residuals expand along high-curvature directions, which underlie the squeezing phenomenon.
- The authors demonstrate that Sharpness-Aware Minimization (SAM) can suppress this behavior via curvature-regularization and introduce logits-SAM, a computationally efficient variant that perturbs only the output layer.
- Experiments on Pythia-2.8B, Mistral-7B, and Gemma-2B-IT show that logits-SAM consistently improves the effectiveness of DPO and integrates with existing DPO variants, with code available on GitHub.



