Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
arXiv cs.CV / 4/29/2026
📰 NewsModels & Research
Key Points
- The paper argues that human visual preferences are multi-dimensional, but common preference datasets collapse them into single binary winner/loser labels, creating substantial label noise.
- It shows theoretically that this compression can produce conflicting gradient signals that mislead Diffusion Direct Preference Optimization (DPO) during training.
- To solve this, the authors propose Semi-DPO, a semi-supervised framework that treats consistent preference pairs as clean labeled data and conflicting pairs as noisy unlabeled data.
- Semi-DPO first trains on a consensus-filtered clean subset, then uses the resulting model to generate pseudo-labels for noisy pairs for iterative refinement.
- Experiments reportedly achieve state-of-the-art alignment with complex human preferences while avoiding additional human annotations and explicit reward models, and the team plans to release code and models.
Related Articles
LLMs will be a commodity
Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform
Tech.eu

AI Voice Agents in Production: What Actually Works in 2026
Dev.to

How we built a browser-based AI Pathology platform
Dev.to