Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

arXiv cs.AI / 4/21/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies a previously underexplored issue in native omni-modal LLMs: how the models develop a “modality preference” when text and vision are both available.
It introduces a conflict-based benchmark and a modality selection rate metric to systematically quantify modality preference across ten representative OLLMs.
The results show a shift from traditional VLM “text-dominance” to a strong visual preference in most omni-modal LLMs.
Layer-wise probing indicates the preference is not fixed from the start, but gradually emerges in the mid-to-late layers.
The authors use these internal signals to diagnose cross-modal hallucinations and report competitive results on three downstream multimodal benchmarks without task-specific training data.

Abstract

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Reddit r/artificial

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Dev.to

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

The Register

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle

Dev.to

Blaze Balance Engine SaaS

Dev.to

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

Key Points

Abstract

Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle

Blaze Balance Engine SaaS

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer