Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
arXiv cs.CV / 4/29/2026
📰 NewsModels & Research
Key Points
- Unified Multimodal Models (uMMs) may achieve good performance on visual understanding and generation independently, but current benchmarks don’t test whether the two capabilities produce semantically aligned representations.
- The paper introduces XTC-Bench, a scene-graph-grounded evaluation framework that derives both generation prompts and understanding queries from the same structured scene graph to measure cross-task visual semantic consistency.
- It proposes Continuous Cross-Task Agreement (CCTA), a fine-grained metric that compares generation and understanding on matched atomic facts to separate internal consistency from standalone accuracy.
- Experiments across multiple open-source and commercial uMMs show that strong results in either understanding or generation do not necessarily translate into strong cross-task alignment, and architectural unification alone isn’t the main driver of consistency.
- The authors conclude that cross-modal consistency depends on how tightly the learning objectives are coupled across modalities, and they release XTC-Bench as a reproducible, model-agnostic diagnostic tool for representation-level misalignment.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Automatic Error Recovery in AI Agent Networks
Dev.to
AeroJAX: JAX-native CFD, differentiable end-to-end. ~560 FPS at 128x128 on CPU [P]
Reddit r/MachineLearning