PC-MNet: Dual-Level Congruity Modeling for Multimodal Sarcasm Detection via Polarity-Modulated Attention

arXiv cs.CL / 5/5/2026

📰 NewsModels & Research

Key Points

  • PC-MNet proposes a new multimodal sarcasm detection model that targets pragmatic incongruities between literal text and nonverbal cues.
  • Instead of using similarity-based attention and uniform late fusion, it introduces a scalar congruity routing mechanism and a prior-guided contextual graph to better handle functional entanglement.
  • The model uses a two-stage asymmetric optimization with inconsistency-aware contrastive learning to form a generalized incongruity manifold and to fuse only the most discriminative evidence across multiple granularities.
  • Experiments on the MUStARD benchmark and spurious-correlation-mitigated balanced datasets show new state-of-the-art results, improving Macro-F1 by 3.14% over the strongest prior multimodal baseline.
  • The approach aims to architecturally isolate conflicts at atomic, compositional, and contextual levels to more robustly capture subtle pragmatic mismatches in human communication.

Abstract

Multimodal sarcasm detection, which aims to precisely identify pragmatic incongruities between literal text and nonverbal cues, has gained substantial attention in multimodal understanding. Recent advancements have predominantly relied on na\"{\i}ve similarity-based attention mechanisms and uniform late fusion strategies.Furthermore, given that functional entanglement restricts traditional late fusions, we incorporate a scalar congruity routing mechanism and a prior-guided contextual graph. This mechanism anchors a generalized incongruity manifold through a two-stage asymmetric optimization driven by inconsistency-aware contrastive learning, selectively fusing only the most discriminative multi-granularity evidence. Extensive experiments on the \texttt{MUStARD} benchmark and its spurious-correlation-mitigated balanced datasets demonstrate that our approach achieves new state-of-the-art performance, surpassing the strongest multimodal baseline by a substantial 3.14\% improvement in Macro-F1. By architecturally isolating atomic, composition, and contextual conflicts. This work provides a robust, decoupled paradigm for modeling subtle pragmatic incongruities in human communication.