URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces URMF (Uncertainty-aware Robust Multimodal Fusion) to improve multimodal sarcasm detection by explicitly modeling which modality (text, image, or their interaction) is reliable rather than assuming all inputs are equally trustworthy.
  • URMF injects visual evidence into text using multi-head cross-attention, then refines incongruity-aware reasoning with multi-head self-attention over the fused semantic space.
  • It uses aleatoric uncertainty modeling by representing each modality (and interaction-aware latent states) as a learnable Gaussian posterior, and dynamically suppresses unreliable modalities during fusion.
  • The training strategy combines task supervision with modality-prior regularization, cross-modal distribution alignment, and uncertainty-driven self-sampling contrastive learning.
  • Experiments on public multimodal sarcasm detection benchmarks report that URMF outperforms strong unimodal, multimodal, and MLLM-based baselines in both accuracy and robustness.

Abstract

Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, they often assume that all modalities are equally reliable. In real-world social media, however, textual content may be ambiguous and visual content may be weakly relevant or even irrelevant, causing deterministic fusion to introduce noisy evidence and weaken robust reasoning. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework that explicitly models modality reliability during interaction and fusion. URMF first employs multi-head cross-attention to inject visual evidence into textual representations, followed by multi-head self-attention in the fused semantic space to enhance incongruity-aware reasoning. It then performs unified unimodal aleatoric uncertainty modeling over text, image, and interaction-aware latent representations by parameterizing each modality as a learnable Gaussian posterior. The estimated uncertainty is further used to dynamically regulate modality contributions during fusion, suppressing unreliable modalities and yielding a more robust joint representation. In addition, we design a joint training objective integrating task supervision, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven self-sampling contrastive learning. Experiments on public MSD benchmarks show that URMF consistently outperforms strong unimodal, multimodal, and MLLM-based baselines, demonstrating the effectiveness of uncertainty-aware fusion for improving both accuracy and robustness.