TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization
arXiv cs.AI / 5/4/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard DPO can be brittle because it treats human preference signals as flat winner-vs-loser labels and is sensitive to noisy or fragile “chains of thought.”
- It proposes TUR-DPO, which extends DPO by using lightweight reasoning “topologies” and combining semantic faithfulness, usefulness, and topology quality into a calibrated uncertainty signal.
- TUR-DPO introduces a small learnable reward factorized over these components and plugs it into an uncertainty-weighted, RL-free DPO objective that uses only a fixed or moving reference policy.
- Experiments on multiple open 7–8B models across reasoning, factual QA, summarization, and helpful/harmless dialogue show higher judge win-rates, improved faithfulness, and better calibration than DPO.
- The authors report TUR-DPO also yields consistent gains for multimodal and long-context settings and can match or outperform PPO on reasoning-focused tasks while keeping training simpler and avoiding online rollouts.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge
CLMA Frame Test
Dev.to
You Are Right — You Don't Need CLAUDE.md
Dev.to
Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to