Omni-Modal Dissonance Benchmark: Systematically Breaking Modality Consensus to Probe Robustness and Calibrated Abstention

arXiv cs.LG / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces OMD-Bench to disentangle modality reliance in omni-modal systems by starting with fully congruent video, audio, and text anchors and then applying systematic corruption per modality.
  • It directly targets a key confound in existing omni-modal benchmarks: modalities are often naturally co-occurring with correlated but unequal information, so measured “modality contributions” can be misleading.
  • OMD-Bench includes 4,080 instances over 27 anchors across eight corruption conditions, and it evaluates “calibrated abstention” to test whether models correctly refrain when evidence conflicts.
  • Experiments on ten omni-modal models (zero-shot and chain-of-thought prompting) show a tendency to over-abstain when two modalities are corrupted but to under-abstain severely when all three are, even while keeping high stated confidence (~60–100%) under full corruption.
  • Chain-of-thought prompting improves alignment of abstention with human judgment, but it increases overconfidence rather than resolving calibration issues, making the benchmark a diagnostic tool for robustness and uncertainty calibration.

Abstract

Existing omni-modal benchmarks attempt to measure modality-specific contributions, but their measurements are confounded: naturally co-occurring modalities carry correlated yet unequal information, making it unclear whether results reflect true modality reliance or information asymmetry. We introduce OMD-Bench, where all modalities are initially congruent - each presenting the same anchor, an object or event independently perceivable through video, audio, and text - which we then systematically corrupt to isolate each modality's contribution. We also evaluate calibrated abstention: whether models appropriately refrain from answering when evidence is conflicting. The benchmark comprises 4,080 instances spanning 27 anchors across eight corruption conditions. Evaluating ten omni-modal models under zero-shot and chain-of-thought prompting, we find that models over-abstain when two modalities are corrupted yet under-abstain severely when all three are, while maintaining high confidence (~60-100%) even under full corruption. Chain-of-thought prompting improves abstention alignment with human judgment but amplifies overconfidence rather than mitigating it. OMD-Bench provides a diagnostic benchmark for diagnosing modality reliance, robustness to cross-modal inconsistency, and uncertainty calibration in omni-modal systems.