"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?
arXiv cs.CL / 4/8/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies whether large vision-language models (VLMs) can understand multimodal puns, where visual and text jointly signal both literal and figurative meanings.
- It introduces a multimodal pun generation pipeline and releases the MultiPun dataset, including multiple pun types plus adversarial non-pun distractors to test robustness.
- Evaluation shows that most existing models have difficulty correctly distinguishing true puns from closely related distractors.
- The authors present prompt-level and model-level strategies that improve pun comprehension, achieving an average 16.5% gain in F1 scores.
- The findings are positioned as guidance for building future VLMs with more human-like cross-modal reasoning and humor sensitivity.
Related Articles

Meta's latest model is as open as Zuckerberg's private school
The Register

Why multi-agent AI security is broken (and the identity patterns that actually work)
Dev.to
BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.
Reddit r/artificial
A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export
MarkTechPost

Harness Engineering: The Next Evolution of AI Engineering
Dev.to