DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO

arXiv stat.ML / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper compares DPO with DDO-RM for LLM preference optimization, centering on an algorithmic framing of DDO-RM and a minimal held-out benchmark.
  • DDO-RM reframes each prompt as a finite decision problem by updating a policy distribution over multiple candidate responses using reward-model scores, then distilling a reward-guided target distribution back into the policy.
  • Experiments on EleutherAI/pythia-410m with HuggingFaceH4/ultrafeedback_binarized evaluate on the held-out test_prefs split using three random seeds (42, 13, 3407).
  • In this preliminary setup, DDO-RM reports improvements over DPO, including higher mean pair accuracy (0.5238→0.5602) and AUC (0.5315→0.5382), alongside a reported increase in mean margin (0.1377→0.5353).
  • The authors emphasize the results are early and limited to one model family, one dataset, one held-out split, and a small number of seeds, so broader validation is needed.

Abstract

This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark. The benchmark asks a narrow question: even in a minimal pairwise chosen-versus-rejected setting, can a reward-guided decision-distribution update outperform a direct pairwise objective? We compare Direct Preference Optimization (DPO) against DDO-RM on EleutherAI/pythia-410m using HuggingFaceH4/ultrafeedback\_binarized, evaluate on the held-out test\_prefs split, and report results for seeds 42, 13, and 3407. Algorithmically, DDO-RM treats each prompt as a finite decision problem over candidate responses. Instead of optimizing only a binary chosen-rejected relation, it forms a policy distribution over candidates, centers reward-model scores under that distribution, and distills a reward-guided target distribution back into the policy. In the current public benchmark, DDO-RM improves mean pair accuracy from 0.5238 to 0.5602, AUC from 0.5315 to 0.5382, and mean margin from 0.1377 to 0.5353 relative to DPO. These are encouraging but still preliminary results: the study covers one model family, one dataset, one held-out evaluation split, and three seeds.