GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification
arXiv cs.CL / 4/13/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces GRASP, a multimodal framework that combines grounded visual grounding with explicit Chain-of-Thought (CoT) reasoning to improve Multimodal Sarcasm Target Identification (MSTI) beyond binary sarcasm detection.
- It presents the MSTI-MAX dataset, designed to mitigate class imbalance and enrich multimodal sarcasm cues for fine-grained localization of textual phrases and visual regions.
- GRASP uses “Grounded CoT” to anchor sarcasm-relevant visual regions within the reasoning process and requires the model to articulate rationales prior to final label and target predictions.
- The method applies a dual-stage, outcome-supervised joint optimization strategy, starting with coordinate-aware supervised fine-tuning and then performing fine-grained target policy optimization.
- Experiments report improved fine-grained target identification across modalities, with an LLM-as-a-Judge evaluation assessing the quality of internal reasoning chains, and the dataset/source code planned for GitHub release.
Related Articles

Black Hat Asia
AI Business

Apple is building smart glasses without a display to serve as an AI wearable
THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
Chatbot vs Voicebot: The Real Business Decision Nobody Talks About
Dev.to