AI Navigate

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

arXiv cs.LG / 3/19/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper systematically studies whether DPO can align both understanding and generation in unified multimodal models (Janus-Pro at 1B and 7B) across seven training strategies and two post-hoc methods, and finds generation quality resists DPO alignment under all tested conditions.
  • Generation CLIPScore does not improve at 7B, and at 1B all methods degrade generation, regardless of data type (real-vs-generated and model-vs-model) or data volume used (150-288 pairs).
  • Gradient analysis shows understanding and generation gradients are near-orthogonal with a large magnitude imbalance driven by VQ token counts (about 576 generation tokens vs ~30-100 text tokens), making multi-task DPO difficult.
  • The discrete VQ tokenization is identified as a likely structural bottleneck, with the generation DPO loss converging to ln(2); the paper provides practical guidance for practitioners working with VQ-based unified models.

Abstract

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.