CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces CFMS, the first fine-grained Chinese multimodal sarcasm detection benchmark designed to overcome coarse labels and limited cultural coverage in prior datasets.
CFMS contains 2,796 high-quality image-text pairs with a triple-level annotation scheme covering sarcasm identification, target recognition, and explanation generation.
The authors show that fine-grained explanation annotations can help models generate images with more explicit sarcastic intent.
They also release a high-consistency parallel Chinese-English metaphor subset (200 entries each) and demonstrate that current models struggle with metaphoric reasoning.
To improve performance beyond retrieval-based approaches, the authors propose PGDS, a reinforcement learning-augmented in-context learning method that dynamically selects exemplars, achieving strong experimental gains over baselines.

Abstract

Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social media. It comprises 2,796 high-quality image-text pairs and provides a triple-level annotation framework: sarcasm identification, target recognition, and explanation generation. We find that the fine-grained explanation annotations effectively guide AI in generating images with explicit sarcastic intent. Furthermore, we curate a high-consistency parallel Chinese-English metaphor subset (200 entries each), revealing significant limitations of current models in metaphoric reasoning. To overcome the constraints of traditional retrieval methods, we propose a Reinforcement Learning-augmented In-Context Learning strategy (PGDS) to dynamically optimize exemplar selection. Extensive experiments demonstrate that CFMS provides a solid foundation for building reliable multimodal sarcasm understanding systems, and the PGDS method significantly outperforms existing baselines on key tasks. Our data and code are available at https://anonymous.4open.science/r/CFMS-E8F9.

The 2026 Forbes AI 50 List

Reddit r/artificial

Add cryptographic authorization to AI agents in 5 minutes

Dev.to

Building a website with Replit and Vercel

Dev.to

Supercharging Your CI/CD: Integrating TestSprite AI Testing with GitHub Actions

Dev.to

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)

Dev.to

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

Key Points

Abstract

Related Articles

The 2026 Forbes AI 50 List

Add cryptographic authorization to AI agents in 5 minutes

Building a website with Replit and Vercel

Supercharging Your CI/CD: Integrating TestSprite AI Testing with GitHub Actions

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer