GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

arXiv cs.CL / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GRASP, a multimodal framework that combines grounded visual grounding with explicit Chain-of-Thought (CoT) reasoning to improve Multimodal Sarcasm Target Identification (MSTI) beyond binary sarcasm detection.
  • It presents the MSTI-MAX dataset, designed to mitigate class imbalance and enrich multimodal sarcasm cues for fine-grained localization of textual phrases and visual regions.
  • GRASP uses “Grounded CoT” to anchor sarcasm-relevant visual regions within the reasoning process and requires the model to articulate rationales prior to final label and target predictions.
  • The method applies a dual-stage, outcome-supervised joint optimization strategy, starting with coordinate-aware supervised fine-tuning and then performing fine-grained target policy optimization.
  • Experiments report improved fine-grained target identification across modalities, with an LLM-as-a-Judge evaluation assessing the quality of internal reasoning chains, and the dataset/source code planned for GitHub release.

Abstract

Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.