DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

DARC-CLIP is a CLIP-based multimodal framework designed to better understand memes by capturing fine-grained, bidirectional dependencies between visual and textual signals.
It replaces static multimodal fusion with a hierarchical refinement stack using Adaptive Cross-Attention Refiners for dynamic alignment and Dynamic Feature Adapters for task-sensitive signal adaptation.
The model is evaluated on the PrideMM benchmark for hate, target, stance, and humor classification, and also tested for generalization on the CrisisHateMM dataset.
DARC-CLIP delivers strong results, including sizable improvements in hate detection (+4.18 AUROC and +6.84 F1) over the best baseline.
Ablation experiments indicate that the Adaptive Cross-Attention Refiners (ACAR) and Dynamic Feature Adapters (DFA) are the main drivers of the performance gains.

Abstract

Memes convey meaning through the interaction of visual and textual signals, often combining humor, irony, and offense in subtle ways. Detecting harmful or sensitive content in memes requires accurate modeling of these multimodal cues. Existing CLIP-based approaches rely on static fusion, which struggles to capture fine grained dependencies between modalities. We propose DARC-CLIP, a CLIP-based framework for adaptive multimodal fusion with a hierarchical refinement stack. DARC-CLIP introduces Adaptive Cross-Attention Refiners to for bidirectional information alignment and Dynamic Feature Adapters for task-sensitive signal adaptation. We evaluate DARC-CLIP on the PrideMM benchmark, which includes hate, target, stance, and humor classification, and further test generalization on the CrisisHateMM dataset. DARC-CLIP achieves highly competitive classification accuracy across tasks, with significant gains of +4.18 AUROC and +6.84 F1 in hate detection over the strongest baseline. Ablation studies confirm that ACAR and DFA are the main contributors to these gains. These results show that adaptive cross-signal refinement is an effective strategy for multimodal content analysis in socially sensitive classification.

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

Dev.to

DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding

Key Points

Abstract

Related Articles

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Dex lands $5.3M to grow its AI-driven talent matching platform

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer