GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

arXiv cs.CL / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces GRASP, a multimodal framework that combines grounded visual grounding with explicit Chain-of-Thought (CoT) reasoning to improve Multimodal Sarcasm Target Identification (MSTI) beyond binary sarcasm detection.
It presents the MSTI-MAX dataset, designed to mitigate class imbalance and enrich multimodal sarcasm cues for fine-grained localization of textual phrases and visual regions.
GRASP uses “Grounded CoT” to anchor sarcasm-relevant visual regions within the reasoning process and requires the model to articulate rationales prior to final label and target predictions.
The method applies a dual-stage, outcome-supervised joint optimization strategy, starting with coordinate-aware supervised fine-tuning and then performing fine-grained target policy optimization.
Experiments report improved fine-grained target identification across modalities, with an LLM-as-a-Judge evaluation assessing the quality of internal reasoning chains, and the dataset/source code planned for GitHub release.

Abstract

Moving beyond the traditional binary classification paradigm of Multimodal Sarcasm Detection, Multimodal Sarcasm Target Identification (MSTI) presents a more formidable challenge, requiring precise localization of fine-grained targets such as textual phrases and visual regions. Existing approaches predominantly rely on implicit cross-modal alignment, offering limited interpretability and suboptimal fine-grained localization. To address these limitations, we propose GRASP, Grounded Chain-of-Thought ReAsoning with Dual-Stage Optimization for Multimodal Sarcasm Prediction and Target Identification, a framework that integrates visual grounding with explicit Chain-of-Thought (CoT) reasoning to move beyond black-box MSTI. Specifically, we curate MSTI-MAX, a refined dataset that mitigates class imbalance and enriches multimodal sarcasm cues. We introduce Grounded CoT reasoning, which explicitly anchors sarcasm-related visual regions within the reasoning trajectory and prompts the model to articulate rationales before predicting the final classification labels and sarcasm targets. Furthermore, we employ a dual-stage outcome-supervised joint optimization strategy: Supervised Fine-Tuning with a coordinate-aware weighted loss, followed by Fine-Grained Target Policy Optimization. Extensive experiments demonstrate that GRASP outperforms existing baselines in fine-grained sarcasm target identification across modalities, and an LLM-as-a-Judge evaluation quantitatively measures the quality of internal reasoning chains. Our dataset and source code will be released on GitHub.

Black Hat Asia

AI Business

Apple is building smart glasses without a display to serve as an AI wearable

THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

Dev.to

GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

Key Points

Abstract

Related Articles

Black Hat Asia

Apple is building smart glasses without a display to serve as an AI wearable

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer