DPO (Rafailov et al., NeurIPS 2023) is supposed to be the clean alternative to PPO. No reward model in the training loop, no value function, no rollout collection. Just a binary cross-entropy loss over preference pairs. And the math is elegant the partition function Z(x) cancels out when you substitute the log-ratio reparameterisation into the Bradley-Terry model.
I implemented it from scratch as part of a multi-stage RLHF project (same model, same tokenizer, same evaluation suite as my PPO and GRPO implementations). Here's what actually happened.
The get_logps function
This is where silent failures live. The shift has to be exact:
python
shift_logits = logits[:, :-1, :] # predict positions 1..T shift_labels = input_ids[:, 1:] # actual tokens 1..T shift_mask = response_mask[:, 1:] # only response positions The mask shifts by one to align with shifted labels. Get this wrong and the loss looks normal while the model is supervising prompt tokens instead of response tokens. No obvious error signal.
What reward hacking looks like in a loss curve
By step 30, loss = 0.0 and accuracy = 1.0. This looks like fast convergence. It isn't.
The reward margin tells the real story:
| Step | Margin |
|---|---|
| 30 | 56.9 |
| 70 | 240.7 |
| 150 | 599.2 |
A healthy margin is 1–10. At 599 the policy has drifted so far from the reference that it assigns near-zero probability to the rejected response for every pair. The model memorised the preference signal rather than learning a generalizable preference.
Root cause: batch size of 1 with no averaging. Each update can completely overfit one (chosen, rejected) pair before moving to the next.
What the step 20 behaviour tells you
At step 20: loss = 0.693, accuracy = 0.0, margin = 0.0.
0.693 = log(2) = -log(σ(0)). This is the degenerate case the theory predicts when the policy exactly mirrors the reference, all log-ratios are zero, the DPO margin is zero, and the loss equals log 2. The model is assigning equal probability to chosen and rejected. Seeing this in a real training run is a nice confirmation that the implementation is correct.
The verdict
The architecture is sound. The loss, the frozen reference model, the get_logps masking, the RM-free training loop all correct. What broke was the training configuration, not the algorithm. These Phase 1 results (avg reward: 2.40) were later tuned β from 0.1 to 0.3, proper batching and compared head-to-head against PPO and GRPO on the same 16 prompts.
The full comparison is in a separate write-up. The ranking completely reversed after tuning. DPO went from 3rd to 1st.
Full DPO implementation post: brayanbrayan.github.io/machine-learning/rlhf/2026/03/24/dpo-implementation-blog.html
Full comparison study: brayanbrayan.github.io/2026/04/02/rlhf-post-blog.html
Happy to answer questions on any of the implementation details.
[link] [comments]



