Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

arXiv cs.LG / 3/19/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper systematically studies whether DPO can align both understanding and generation in unified multimodal models (Janus-Pro at 1B and 7B) across seven training strategies and two post-hoc methods, and finds generation quality resists DPO alignment under all tested conditions.
Generation CLIPScore does not improve at 7B, and at 1B all methods degrade generation, regardless of data type (real-vs-generated and model-vs-model) or data volume used (150-288 pairs).
Gradient analysis shows understanding and generation gradients are near-orthogonal with a large magnitude imbalance driven by VQ token counts (about 576 generation tokens vs ~30-100 text tokens), making multi-task DPO difficult.
The discrete VQ tokenization is identified as a likely structural bottleneck, with the generation DPO loss converging to ln(2); the paper provides practical guidance for practitioners working with VQ-based unified models.

Abstract

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.

Interactive Web Visualization of GPT-2

Reddit r/artificial

Stop Treating AI Interview Fraud Like a Proctoring Problem

Dev.to

[R] Causal self-attention as a probabilistic model over embeddings

Reddit r/MachineLearning

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

Dev.to

InVideo AI Review: Fast Finished

Dev.to

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

Key Points

Abstract

Related Articles

Interactive Web Visualization of GPT-2

Stop Treating AI Interview Fraud Like a Proctoring Problem

[R] Causal self-attention as a probabilistic model over embeddings

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

InVideo AI Review: Fast Finished

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer