OMHBench: Benchmarking Balanced and Grounded Omni-Modal Multi-Hop Reasoning

arXiv cs.CL / 4/29/2026

💬 OpinionModels & Research

Key Points

  • The paper introduces OMHBench, a new benchmark (6,144 questions) built to test omni-modal multi-hop reasoning across text, vision, and speech with balanced, jointly grounded reasoning paths.
  • It argues that existing MLLM evaluation frameworks are flawed because they allow modality shortcuts and biased reasoning trajectories.
  • Evaluations of 13 state-of-the-art MLLMs show a substantial performance gap between proprietary and open-source models.
  • The study finds proprietary models are still highly sensitive to how reasoning paths vary, leading to uneven grounding across modalities.
  • Models perform worst when processing the speech modality, highlighting the need for balanced omni-modal, multi-hop evaluation rather than text/vision-only testing.

Abstract

Multimodal Large Language Models (MLLMs) have increasingly supported omni-modal processing across text, vision, and speech. However, existing evaluation frameworks for such models suffer from critical limitations, including modality shortcuts and biased reasoning paths. To address these challenges, we propose OMHBench, a novel benchmark designed to rigorously evaluate omni-modal multi-hop reasoning. It consists of 6,144 questions with balanced reasoning paths that are jointly grounded across all three modalities. Extensive evaluation of 13 state-of-the-art models reveals that (1) a large performance gap exists between proprietary and open-source MLLMs and (2) even proprietary models exhibit high sensitivity to reasoning path variations, resulting in asymmetric omni-modal grounding. Notably, models struggle when processing the speech modality, underscoring the need for balanced, multi-hop evaluation of omni-modal intelligence.