A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation

arXiv cs.CV / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses multimodal empathetic response generation (MERG), arguing that common one-pass methods miss the structured nature of human emotion perception and can introduce emotional biases.
  • It proposes a multi-agent framework that performs structured reasoning via decomposed modules, including emotion forecasting, pragmatic strategy planning, and strategy-guided response generation from multimodal inputs.
  • It adds a global reflection and refinement loop where a reflection agent audits intermediate states and the draft response step-by-step to remove empathy errors and trigger targeted regeneration.
  • Experiments on benchmarks such as IEMOCAP and MELD show improved empathic response generation compared with prior state-of-the-art approaches.
  • The overall contribution is a closed-loop process that iteratively improves emotion perception accuracy and reduces emotion biases during generation.

Abstract

Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users' multimodal contexts. Existing approaches usually rely on an implicit one-pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi-agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning-to-generation module that explicitly decomposes response generation via multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step-wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed-loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state-of-the-art methods.