Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

arXiv cs.CV / 4/29/2026

📰 NewsModels & Research

Key Points

  • Unified Multimodal Models (uMMs) may achieve good performance on visual understanding and generation independently, but current benchmarks don’t test whether the two capabilities produce semantically aligned representations.
  • The paper introduces XTC-Bench, a scene-graph-grounded evaluation framework that derives both generation prompts and understanding queries from the same structured scene graph to measure cross-task visual semantic consistency.
  • It proposes Continuous Cross-Task Agreement (CCTA), a fine-grained metric that compares generation and understanding on matched atomic facts to separate internal consistency from standalone accuracy.
  • Experiments across multiple open-source and commercial uMMs show that strong results in either understanding or generation do not necessarily translate into strong cross-task alignment, and architectural unification alone isn’t the main driver of consistency.
  • The authors conclude that cross-modal consistency depends on how tightly the learning objectives are coupled across modalities, and they release XTC-Bench as a reproducible, model-agnostic diagnostic tool for representation-level misalignment.

Abstract

Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts, isolating internal consistency from standalone task accuracy. Extensive experiments on eight open-source and one commercial unified models reveal that high generation or understanding performance does not imply strong cross-task alignment, and architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone. XTC-Bench provides a reproducible and model-agnostic framework for diagnosing representation-level misalignment, offering a concrete direction for advancing unified multimodal modeling beyond isolated task performance.