MIBench: Evaluating LMMs on Multimodal Interaction

arXiv cs.CV / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

MIBench is introduced as a comprehensive benchmark to evaluate multimodal interaction in Large Multimodal Models (LMMs), formulating each instance as a (con_v, con_t, task) triplet that combines vision and text contexts to test appropriate multimodal interaction.
It assesses three interaction capabilities—sourcing information from vision-centric cues, sourcing from text-centric cues, and generating new information from joint synergy—across three cognitive levels: Recognition, Understanding, and Reasoning.
The benchmark includes over 10,000 vision-text context pairs across 32 tasks, and evaluations show that state-of-the-art LMMs remain constrained in multimodal interaction, are easily distracted by textual modalities when processing vision, and have limited multimodal synergy with native multimodal models exhibiting deficits in fundamental interaction ability.
The authors anticipate MIBench will serve as a reference for developing more capable LMMs in the future and guiding research toward enhanced multimodal interaction.

Abstract

In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. MIBench comprises over 10,000 vision-text context pairs spanning 32 distinct tasks. Evaluation of state-of-the-art LMMs show that: (1) LMMs' ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.