A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a model-agnostic framework that uses partial information decomposition (PID) to quantify how vision-language models derive decision-relevant information from redundant, unique, and synergistic components.
It profiles 26 LVLMs across four datasets using a scalable estimator, analyzing “information spectrum” breadth across models/tasks, depth via layer-wise information dynamics, and time across training.
The study identifies two task regimes—synergy-driven versus knowledge-driven—and two stable family-level strategies—fusion-centric versus language-centric—in how LVLMs form answers.
It finds a consistent three-phase pattern in layer-wise processing and concludes that visual instruction tuning is the key stage where multimodal fusion is learned.
The authors argue this quantitative approach extends beyond accuracy-only evaluation and can inform the analysis and design of next-generation LVLMs, with code/data provided in a public repository.

Abstract

Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck

Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets

Dev.to

[P] Federated Adversarial Learning

Reddit r/MachineLearning

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility

Towards Data Science

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

Key Points

Abstract

Related Articles

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck

Agent Self-Discovery: How AI Agents Find Their Own Wallets

[P] Federated Adversarial Learning

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer