Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models
arXiv cs.LG / 3/19/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper systematically studies whether DPO can align both understanding and generation in unified multimodal models (Janus-Pro at 1B and 7B) across seven training strategies and two post-hoc methods, and finds generation quality resists DPO alignment under all tested conditions.
- Generation CLIPScore does not improve at 7B, and at 1B all methods degrade generation, regardless of data type (real-vs-generated and model-vs-model) or data volume used (150-288 pairs).
- Gradient analysis shows understanding and generation gradients are near-orthogonal with a large magnitude imbalance driven by VQ token counts (about 576 generation tokens vs ~30-100 text tokens), making multi-task DPO difficult.
- The discrete VQ tokenization is identified as a likely structural bottleneck, with the generation DPO loss converging to ln(2); the paper provides practical guidance for practitioners working with VQ-based unified models.
Related Articles

Interactive Web Visualization of GPT-2
Reddit r/artificial
Stop Treating AI Interview Fraud Like a Proctoring Problem
Dev.to
[R] Causal self-attention as a probabilistic model over embeddings
Reddit r/MachineLearning
The 5 software development trends that actually matter in 2026 (and what they mean for your startup)
Dev.to
InVideo AI Review: Fast Finished
Dev.to