Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

arXiv cs.CV / 4/29/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that human visual preferences are multi-dimensional, but common preference datasets collapse them into single binary winner/loser labels, creating substantial label noise.
It shows theoretically that this compression can produce conflicting gradient signals that mislead Diffusion Direct Preference Optimization (DPO) during training.
To solve this, the authors propose Semi-DPO, a semi-supervised framework that treats consistent preference pairs as clean labeled data and conflicting pairs as noisy unlabeled data.
Semi-DPO first trains on a consensus-filtered clean subset, then uses the resulting model to generate pseudo-labels for noisy pairs for iterative refinement.
Experiments reportedly achieve state-of-the-art alignment with complex human preferences while avoiding additional human annotations and explicit reward models, and the team plans to release code and models.

Abstract

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: https://github.com/L-CodingSpace/semi-dpo

LLMs will be a commodity

Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

AI Voice Agents in Production: What Actually Works in 2026

Dev.to

How we built a browser-based AI Pathology platform

Dev.to

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Key Points

Abstract

Related Articles

LLMs will be a commodity

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Dex lands $5.3M to grow its AI-driven talent matching platform

AI Voice Agents in Production: What Actually Works in 2026

How we built a browser-based AI Pathology platform

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer