Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

arXiv cs.AI / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that existing multimodal misinformation verification methods fail due to “feature dilution,” where holistic fusion averages out subtle local semantic inconsistencies.
  • It proposes MaLSF (Mask-aware Local Semantic Fusion), which uses mask-label pairs as semantic anchors to actively connect image regions (“pixels”) with textual meaning (“words”).
  • MaLSF introduces a Bidirectional Cross-modal Verification (BCV) module that uses parallel query streams (Text-as-Query and Image-as-Query) to explicitly locate cross-modal conflicts.
  • It also adds a Hierarchical Semantic Aggregation (HSA) module to combine conflict signals at multiple granularities for task-specific reasoning.
  • The approach includes multiple parsers to extract fine-grained mask-label pair anchors and reports state-of-the-art results on DGM4 and multimodal fake news detection, supported by ablations and visualization.

Abstract

As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.