AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • AffectAgent is a multi-agent retrieval-augmented multimodal emotion recognition framework designed to reduce hallucinations and better capture nuanced affective states across modalities.
  • The system uses three specialized, jointly optimized agents—a query planner, an evidence filter, and an emotion generator—to retrieve cross-modal evidence, assess it, and produce emotion predictions.
  • AffectAgent is end-to-end trained with Multi-Agent Proximal Policy Optimization (MAPPO) using a shared affective reward to align the agents’ collaborative reasoning.
  • It introduces Modality-Balancing Mixture of Experts (MB-MoE) to dynamically weight modalities and mitigate cross-modal representation mismatches, and Retrieval-Augmented Adaptive Fusion (RAAF) to improve predictions when a modality is missing.
  • Experiments on MER-UniBench report that AffectAgent achieves stronger performance than prior approaches, and the authors plan to release the code publicly.

Abstract

LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.