Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

arXiv cs.CV / 4/29/2026

📰 NewsModels & Research

Key Points

  • The paper argues that human visual preferences are multi-dimensional, but common preference datasets collapse them into single binary winner/loser labels, creating substantial label noise.
  • It shows theoretically that this compression can produce conflicting gradient signals that mislead Diffusion Direct Preference Optimization (DPO) during training.
  • To solve this, the authors propose Semi-DPO, a semi-supervised framework that treats consistent preference pairs as clean labeled data and conflicting pairs as noisy unlabeled data.
  • Semi-DPO first trains on a consensus-filtered clean subset, then uses the resulting model to generate pseudo-labels for noisy pairs for iterative refinement.
  • Experiments reportedly achieve state-of-the-art alignment with complex human preferences while avoiding additional human annotations and explicit reward models, and the team plans to release code and models.

Abstract

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: https://github.com/L-CodingSpace/semi-dpo