Head-wise Modality Specialization within MLLMs for Robust Fake News Detection under Missing Modality

arXiv cs.CV / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 本論文は、画像などのモダリティが欠落する状況でも信頼性を検証できるマルチモーダル偽ニュース検出(MFND)を扱い、欠落時に各モダリティの検証能力が失われやすい課題を示している。
  • MLLM内部の注意(attention)ヘッドを分析し、欠落に強い性能を支える「モダリティ臨界ヘッド」がモダリティ特化によって単一モダリティの検証能力を運ぶと明らかにしている。
  • その知見に基づき、ヘッドをモダリティごとに割り当てつつ特化を保持する「head-wise modality specialization(低下限のattention制約付き)」と、少数の単一モダリティ注釈で学んだ知識の逸脱を防ぐ「Unimodal Knowledge Retention」を提案している。
  • 実験では、欠落モダリティ下でのロバスト性が向上しつつ、フルのマルチモーダル入力時の性能低下を抑えられることが示されている。

Abstract

Multimodal fake news detection (MFND) aims to verify news credibility by jointly exploiting textual and visual evidence. However, real-world news dissemination frequently suffers from missing modality due to deleted images, corrupted screenshots, and similar issues. Thus, robust detection in this scenario requires preserving strong verification ability for each modality, which is challenging in MFND due to insufficient learning of the low-contribution modality and scarce unimodal annotations. To address this issue, we propose Head-wise Modality Specialization within Multimodal Large Language Models (MLLMs) for robust MFND under missing modality. Specifically, we first systematically study attention heads in MLLMs and their relationship with performance under missing modality, showing that modality-critical heads serve as key carriers of unimodal verification ability through their modality specialization. Based on this observation, to better preserve verification ability for the low-contribution modality, we introduce a head-wise specialization mechanism that explicitly allocates these heads to different modalities and preserves their specialization through lower-bound attention constraints. Furthermore, to better exploit scarce unimodal annotations, we propose a Unimodal Knowledge Retention strategy that prevents these heads from drifting away from the unimodal knowledge learned from limited supervision. Experiments show that our method improves robustness under missing modality while preserving performance with full multimodal input.