Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

arXiv cs.LG / 3/19/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper systematically studies whether DPO can align both understanding and generation in unified multimodal models (Janus-Pro at 1B and 7B) across seven training strategies and two post-hoc methods, and finds generation quality resists DPO alignment under all tested conditions.
Generation CLIPScore does not improve at 7B, and at 1B all methods degrade generation, regardless of data type (real-vs-generated and model-vs-model) or data volume used (150-288 pairs).
Gradient analysis shows understanding and generation gradients are near-orthogonal with a large magnitude imbalance driven by VQ token counts (about 576 generation tokens vs ~30-100 text tokens), making multi-task DPO difficult.
The discrete VQ tokenization is identified as a likely structural bottleneck, with the generation DPO loss converging to ln(2); the paper provides practical guidance for practitioners working with VQ-based unified models.

Abstract

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.

How to Build an AI Team: The Solopreneur Playbook

Dev.to

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use

Dev.to

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

Dev.to

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Reddit r/MachineLearning

Experiment: How far can a 28M model go in business email generation?

Reddit r/LocalLLaMA

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

Key Points

Abstract

Related Articles

How to Build an AI Team: The Solopreneur Playbook

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Experiment: How far can a 28M model go in business email generation?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer