I implemented DPO from the paper and the reward margin hit 599 here's what that actually means [R]

Reddit r/MachineLearning / 4/10/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The author describes implementing DPO (from Rafailov et al., NeurIPS 2023) from scratch in an RLHF pipeline and reports that the core math and training loop structure were correct (RM-free, frozen reference model, partition term cancels).
A key implementation risk was the exact token/logit alignment in the `get_logps` function—if the shift/masking is off, the loss can appear normal while supervising the wrong tokens (e.g., prompt instead of response).
Training initially looked successful with loss dropping to ~0 and accuracy rising to 1, but the reward margin exploded to 599 (far above a healthy 1–10), indicating severe policy drift and reward-model-like overfitting where rejected responses receive near-zero probability.
The root cause of the failure was identified as an overly aggressive configuration: batch size of 1 without averaging, allowing each update to overfit a single (chosen, rejected) pair before moving on.
The run showed a theoretically expected sanity-check behavior at step 20 (loss ≈ log 2, margin 0, accuracy 0), suggesting correctness of the algorithmic implementation, and later improvements (tuning β, proper batching) restored stable Phase 1 results and enabled head-to-head comparison versus PPO and GRPO.

DPO (Rafailov et al., NeurIPS 2023) is supposed to be the clean alternative to PPO. No reward model in the training loop, no value function, no rollout collection. Just a binary cross-entropy loss over preference pairs. And the math is elegant the partition function Z(x) cancels out when you substitute the log-ratio reparameterisation into the Bradley-Terry model.

I implemented it from scratch as part of a multi-stage RLHF project (same model, same tokenizer, same evaluation suite as my PPO and GRPO implementations). Here's what actually happened.

The get_logps function

This is where silent failures live. The shift has to be exact:

python

shift_logits = logits[:, :-1, :] # predict positions 1..T shift_labels = input_ids[:, 1:] # actual tokens 1..T shift_mask = response_mask[:, 1:] # only response positions

The mask shifts by one to align with shifted labels. Get this wrong and the loss looks normal while the model is supervising prompt tokens instead of response tokens. No obvious error signal.

What reward hacking looks like in a loss curve

By step 30, loss = 0.0 and accuracy = 1.0. This looks like fast convergence. It isn't.

The reward margin tells the real story:

Step	Margin
30	56.9
70	240.7
150	599.2

A healthy margin is 1–10. At 599 the policy has drifted so far from the reference that it assigns near-zero probability to the rejected response for every pair. The model memorised the preference signal rather than learning a generalizable preference.

Root cause: batch size of 1 with no averaging. Each update can completely overfit one (chosen, rejected) pair before moving to the next.

What the step 20 behaviour tells you

At step 20: loss = 0.693, accuracy = 0.0, margin = 0.0.

0.693 = log(2) = -log(σ(0)). This is the degenerate case the theory predicts when the policy exactly mirrors the reference, all log-ratios are zero, the DPO margin is zero, and the loss equals log 2. The model is assigning equal probability to chosen and rejected. Seeing this in a real training run is a nice confirmation that the implementation is correct.

The verdict

The architecture is sound. The loss, the frozen reference model, the get_logps masking, the RM-free training loop all correct. What broke was the training configuration, not the algorithm. These Phase 1 results (avg reward: 2.40) were later tuned β from 0.1 to 0.3, proper batching and compared head-to-head against PPO and GRPO on the same 16 prompts.

The full comparison is in a separate write-up. The ranking completely reversed after tuning. DPO went from 3rd to 1st.

Full DPO implementation post: brayanbrayan.github.io/machine-learning/rlhf/2026/03/24/dpo-implementation-blog.html

Full comparison study: brayanbrayan.github.io/2026/04/02/rlhf-post-blog.html

Happy to answer questions on any of the implementation details.

submitted by /u/Public_Expression_92
[link] [comments]