Good Scores, Bad Data: A Metric for Multimodal Coherence

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multimodal AI evaluation based only on downstream accuracy can miss cases where inputs are incoherent, such as contradictory image/question signals still yielding strong VQA results.
  • It introduces the Multimodal Coherence Score (MCS), which measures fusion quality without relying on any downstream task model performance.
  • MCS breaks coherence into four independently testable dimensions—identity, spatial, semantic, and decision—with dimension weights learned via Nelder-Mead optimization.
  • Experiments on 1,000 Visual Genome images and validation on 150 COCO images show that MCS can better discriminate fusion quality than task accuracy alone, using DETR, CLIP, and ViLT as evaluation backbones.
  • Perturbation tests indicate low or zero cross-talk between dimensions, and the metric is designed to be lightweight and annotation-free while also helping diagnose which coherence aspect fails.

Abstract

Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight, requires no human annotation, and tells you not just that something broke, but what broke.