Omni-Modal Dissonance Benchmark: Systematically Breaking Modality Consensus to Probe Robustness and Calibrated Abstention

arXiv cs.LG / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces OMD-Bench to disentangle modality reliance in omni-modal systems by starting with fully congruent video, audio, and text anchors and then applying systematic corruption per modality.
It directly targets a key confound in existing omni-modal benchmarks: modalities are often naturally co-occurring with correlated but unequal information, so measured “modality contributions” can be misleading.
OMD-Bench includes 4,080 instances over 27 anchors across eight corruption conditions, and it evaluates “calibrated abstention” to test whether models correctly refrain when evidence conflicts.
Experiments on ten omni-modal models (zero-shot and chain-of-thought prompting) show a tendency to over-abstain when two modalities are corrupted but to under-abstain severely when all three are, even while keeping high stated confidence (~60–100%) under full corruption.
Chain-of-thought prompting improves alignment of abstention with human judgment, but it increases overconfidence rather than resolving calibration issues, making the benchmark a diagnostic tool for robustness and uncertainty calibration.

Abstract

Existing omni-modal benchmarks attempt to measure modality-specific contributions, but their measurements are confounded: naturally co-occurring modalities carry correlated yet unequal information, making it unclear whether results reflect true modality reliance or information asymmetry. We introduce OMD-Bench, where all modalities are initially congruent - each presenting the same anchor, an object or event independently perceivable through video, audio, and text - which we then systematically corrupt to isolate each modality's contribution. We also evaluate calibrated abstention: whether models appropriately refrain from answering when evidence is conflicting. The benchmark comprises 4,080 instances spanning 27 anchors across eight corruption conditions. Evaluating ten omni-modal models under zero-shot and chain-of-thought prompting, we find that models over-abstain when two modalities are corrupted yet under-abstain severely when all three are, while maintaining high confidence (~60-100%) even under full corruption. Chain-of-thought prompting improves abstention alignment with human judgment but amplifies overconfidence rather than mitigating it. OMD-Bench provides a diagnostic benchmark for diagnosing modality reliance, robustness to cross-modal inconsistency, and uncertainty calibration in omni-modal systems.

Why AI agent teams are just hoping their agents behave

Dev.to

Harness as Code: Treating AI Workflows Like Infrastructure

Dev.to

How to Make Claude Code Better at One-Shotting Implementations

Towards Data Science

The Crypto AI Agent Stack That Costs $0/Month to Run

Dev.to

Bag of Freebies for Training Object Detection Neural Networks

Dev.to

Omni-Modal Dissonance Benchmark: Systematically Breaking Modality Consensus to Probe Robustness and Calibrated Abstention

Key Points

Abstract

Related Articles

Why AI agent teams are just hoping their agents behave

Harness as Code: Treating AI Workflows Like Infrastructure

How to Make Claude Code Better at One-Shotting Implementations

The Crypto AI Agent Stack That Costs $0/Month to Run

Bag of Freebies for Training Object Detection Neural Networks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer