Omni-Modal Dissonance Benchmark: Systematically Breaking Modality Consensus to Probe Robustness and Calibrated Abstention
arXiv cs.LG / 3/31/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces OMD-Bench to disentangle modality reliance in omni-modal systems by starting with fully congruent video, audio, and text anchors and then applying systematic corruption per modality.
- It directly targets a key confound in existing omni-modal benchmarks: modalities are often naturally co-occurring with correlated but unequal information, so measured “modality contributions” can be misleading.
- OMD-Bench includes 4,080 instances over 27 anchors across eight corruption conditions, and it evaluates “calibrated abstention” to test whether models correctly refrain when evidence conflicts.
- Experiments on ten omni-modal models (zero-shot and chain-of-thought prompting) show a tendency to over-abstain when two modalities are corrupted but to under-abstain severely when all three are, even while keeping high stated confidence (~60–100%) under full corruption.
- Chain-of-thought prompting improves alignment of abstention with human judgment, but it increases overconfidence rather than resolving calibration issues, making the benchmark a diagnostic tool for robustness and uncertainty calibration.
Related Articles
Why AI agent teams are just hoping their agents behave
Dev.to
Harness as Code: Treating AI Workflows Like Infrastructure
Dev.to
How to Make Claude Code Better at One-Shotting Implementations
Towards Data Science
The Crypto AI Agent Stack That Costs $0/Month to Run
Dev.to
Bag of Freebies for Training Object Detection Neural Networks
Dev.to