Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media
arXiv cs.CL / 3/20/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- We propose an interpretable-by-design multimodal classification framework that jointly learns text and image representations with a visual-language transformer and extracts text rationales to explain predictions.
- The method introduces cross-modal rationale transfer, learning image rationales by mapping from text rationales to reduce annotation effort.
- On CrisisMMD, it boosts Macro-F1 by 2-35% and achieves 80% accuracy in zero-shot mode, while producing text rationales and image patches as explanations.
- Human evaluation reports about 12% improvements in retrieved image rationale patches, aiding identification of humanitarian categories.
Related Articles
Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.
Mistral AI Blog
Anyone who has any common sense knows that AI agents in marketing just don’t exist.
Dev.to
How to Use MiMo V2 API for Free in 2026: Complete Guide
Dev.to
The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context
Dev.to
From Chaos to Compliance: AI Automation for the Mobile Kitchen
Dev.to