Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
arXiv cs.LG / 4/6/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies Active Preference Learning (APL) for online Direct Preference Optimization (DPO) with modern LLMs and asks whether uncertainty-based sampling beats simple Random selection when pretraining priors are strong.
- Across multiple evaluation dimensions—harmlessness, helpfulness, and instruction-following—using reward models and LLM-as-a-judge proxies, APL delivers negligible improvements in proxy win-rates over Random sampling.
- The authors observe a dissociation where win-rate can improve while overall general capability (per standard benchmarks) degrades, indicating possible tradeoffs or misalignment between proxy judgments and broader quality.
- APL does not substantially reduce variance or prevent “capability collapse” better than random sampling, even though it adds computational overhead for active selection.
- The study concludes that, under strong pre-trained priors, the extra cost of active selection is hard to justify versus Random’s “cheap diversity,” and they release code publicly.




