"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

arXiv cs.CL / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies whether large vision-language models (VLMs) can understand multimodal puns, where visual and text jointly signal both literal and figurative meanings.
  • It introduces a multimodal pun generation pipeline and releases the MultiPun dataset, including multiple pun types plus adversarial non-pun distractors to test robustness.
  • Evaluation shows that most existing models have difficulty correctly distinguishing true puns from closely related distractors.
  • The authors present prompt-level and model-level strategies that improve pun comprehension, achieving an average 16.5% gain in F1 scores.
  • The findings are positioned as guidance for building future VLMs with more human-like cross-modal reasoning and humor sensitivity.

Abstract

Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.