"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

arXiv cs.CL / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies whether large vision-language models (VLMs) can understand multimodal puns, where visual and text jointly signal both literal and figurative meanings.
It introduces a multimodal pun generation pipeline and releases the MultiPun dataset, including multiple pun types plus adversarial non-pun distractors to test robustness.
Evaluation shows that most existing models have difficulty correctly distinguishing true puns from closely related distractors.
The authors present prompt-level and model-level strategies that improve pun comprehension, achieving an average 16.5% gain in F1 scores.
The findings are positioned as guidance for building future VLMs with more human-like cross-modal reasoning and humor sensitivity.

Abstract

Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

Meta's latest model is as open as Zuckerberg's private school

The Register

Why multi-agent AI security is broken (and the identity patterns that actually work)

Dev.to

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

Reddit r/artificial

A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export

MarkTechPost

Harness Engineering: The Next Evolution of AI Engineering

Dev.to

"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

Key Points

Abstract

Related Articles

Meta's latest model is as open as Zuckerberg's private school

Why multi-agent AI security is broken (and the identity patterns that actually work)

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export

Harness Engineering: The Next Evolution of AI Engineering

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer