CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces CFMS, the first fine-grained Chinese multimodal sarcasm detection benchmark designed to overcome coarse labels and limited cultural coverage in prior datasets.
  • CFMS contains 2,796 high-quality image-text pairs with a triple-level annotation scheme covering sarcasm identification, target recognition, and explanation generation.
  • The authors show that fine-grained explanation annotations can help models generate images with more explicit sarcastic intent.
  • They also release a high-consistency parallel Chinese-English metaphor subset (200 entries each) and demonstrate that current models struggle with metaphoric reasoning.
  • To improve performance beyond retrieval-based approaches, the authors propose PGDS, a reinforcement learning-augmented in-context learning method that dynamically selects exemplars, achieving strong experimental gains over baselines.

Abstract

Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social media. It comprises 2,796 high-quality image-text pairs and provides a triple-level annotation framework: sarcasm identification, target recognition, and explanation generation. We find that the fine-grained explanation annotations effectively guide AI in generating images with explicit sarcastic intent. Furthermore, we curate a high-consistency parallel Chinese-English metaphor subset (200 entries each), revealing significant limitations of current models in metaphoric reasoning. To overcome the constraints of traditional retrieval methods, we propose a Reinforcement Learning-augmented In-Context Learning strategy (PGDS) to dynamically optimize exemplar selection. Extensive experiments demonstrate that CFMS provides a solid foundation for building reliable multimodal sarcasm understanding systems, and the PGDS method significantly outperforms existing baselines on key tasks. Our data and code are available at https://anonymous.4open.science/r/CFMS-E8F9.