Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

arXiv cs.CV / 4/29/2026

📰 NewsModels & Research

共有:

Key Points

Unified Multimodal Models (uMMs) may achieve good performance on visual understanding and generation independently, but current benchmarks don’t test whether the two capabilities produce semantically aligned representations.
The paper introduces XTC-Bench, a scene-graph-grounded evaluation framework that derives both generation prompts and understanding queries from the same structured scene graph to measure cross-task visual semantic consistency.
It proposes Continuous Cross-Task Agreement (CCTA), a fine-grained metric that compares generation and understanding on matched atomic facts to separate internal consistency from standalone accuracy.
Experiments across multiple open-source and commercial uMMs show that strong results in either understanding or generation do not necessarily translate into strong cross-task alignment, and architectural unification alone isn’t the main driver of consistency.
The authors conclude that cross-modal consistency depends on how tightly the learning objectives are coupled across modalities, and they release XTC-Bench as a reproducible, model-agnostic diagnostic tool for representation-level misalignment.

Abstract

Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts, isolating internal consistency from standalone task accuracy. Extensive experiments on eight open-source and one commercial unified models reveal that high generation or understanding performance does not imply strong cross-task alignment, and architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone. XTC-Bench provides a reproducible and model-agnostic framework for diagnosing representation-level misalignment, offering a concrete direction for advancing unified multimodal modeling beyond isolated task performance.

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Automatic Error Recovery in AI Agent Networks

Dev.to

AeroJAX: JAX-native CFD, differentiable end-to-end. ~560 FPS at 128x128 on CPU [P]

Reddit r/MachineLearning

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

Key Points

Abstract

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Automatic Error Recovery in AI Agent Networks

AeroJAX: JAX-native CFD, differentiable end-to-end. ~560 FPS at 128x128 on CPU [P]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer