Feature-level Interaction Explanations in Multimodal Transformers

arXiv cs.LG / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Introduces Feature-level I2MoE (FL-I2MoE), a structured Mixture-of-Experts layer that operates on token/patch sequences from frozen pretrained encoders to separate unique, synergistic, and redundant evidence at the feature level.
Proposes an expert-wise explanation pipeline combining attribution with top-K% masking to assess faithfulness, and introduces Monte Carlo interaction probes including the Shapley Interaction Index (SII) and a redundancy-gap score to quantify cross-modal interactions.
Demonstrates on MMIMDb, ENRICO, and MMHS150K that FL-I2MoE yields more interaction-specific and concentrated importance patterns than a dense Transformer with the same encoders.
Provides causal evidence that removing pairs ranked by SII or redundancy-gap degrades performance more than random masking under the same budget, suggesting the identified interactions are causally relevant.

Abstract

Multimodal Transformers often produce predictions without clarifying how different modalities jointly support a decision. Most existing multimodal explainable AI (MXAI) methods extend unimodal saliency to multimodal backbones, highlighting important tokens or patches within each modality, but they rarely pinpoint which cross-modal feature pairs provide complementary evidence (synergy) or serve as reliable backups (redundancy). We present Feature-level I2MoE (FL-I2MoE), a structured Mixture-of-Experts layer that operates directly on token/patch sequences from frozen pretrained encoders and explicitly separates unique, synergistic, and redundant evidence at the feature level. We further develop an expert-wise explanation pipeline that combines attribution with top-K% masking to assess faithfulness, and we introduce Monte Carlo interaction probes to quantify pairwise behavior: the Shapley Interaction Index (SII) to score synergistic pairs and a redundancy-gap score to capture substitutable (redundant) pairs. Across three benchmarks (MMIMDb, ENRICO, and MMHS150K), FL-I2MoE yields more interactionspecific and concentrated importance patterns than a dense Transformer with the same encoders. Finally, pair-level masking shows that removing pairs ranked by SII or redundancy-gap degrades performance more than masking randomly chosen pairs under the same budget, supporting that the identified interactions are causally relevant.