Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models
arXiv cs.LG / 3/19/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper systematically studies whether DPO can align both understanding and generation in unified multimodal models (Janus-Pro at 1B and 7B) across seven training strategies and two post-hoc methods, and finds generation quality resists DPO alignment under all tested conditions.
- Generation CLIPScore does not improve at 7B, and at 1B all methods degrade generation, regardless of data type (real-vs-generated and model-vs-model) or data volume used (150-288 pairs).
- Gradient analysis shows understanding and generation gradients are near-orthogonal with a large magnitude imbalance driven by VQ token counts (about 576 generation tokens vs ~30-100 text tokens), making multi-task DPO difficult.
- The discrete VQ tokenization is identified as a likely structural bottleneck, with the generation DPO loss converging to ln(2); the paper provides practical guidance for practitioners working with VQ-based unified models.
Related Articles
How to Build an AI Team: The Solopreneur Playbook
Dev.to
CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use
Dev.to

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026
Dev.to
[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it
Reddit r/MachineLearning
Experiment: How far can a 28M model go in business email generation?
Reddit r/LocalLLaMA