MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes
arXiv cs.CL / 4/17/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- The paper introduces a system for the CHiPSAL 2026 shared task to detect hate speech and classify sentiment in Nepali memes written in the Devanagari script.
- It uses a hybrid multimodal cross-modal attention fusion approach that connects CLIP (visual) and BGE-M3 (multilingual text) via 4-head self-attention plus a learnable gating network for per-sample modality weighting.
- Experiments across eight configurations show that explicit cross-modal reasoning improves F1-macro by 5.9% over text-only baselines for Subtask A (binary hate detection).
- The study finds two important issues under this low-resource, script-specific setting: vision models trained with an English-centric focus perform near-random on Devanagari, and common ensemble methods can catastrophically fail due to correlated overfitting when data is scarce.
- The authors provide code for the proposed approach on GitHub for reproducibility and further research.



