Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

arXiv cs.AI / 3/25/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DF-GCN, a dynamic fusion-aware graph convolutional neural network designed for multimodal emotion recognition in conversations (MERC) using modalities such as text, audio, and images.
  • It addresses a key limitation of prior GCN-based MERC approaches by avoiding fixed multimodal fusion parameters across emotion categories, which can force trade-offs in per-emotion performance.
  • DF-GCN integrates ordinary differential equations into GCNs to model the dynamic evolution of emotional dependencies across an utterance interaction graph.
  • It uses prompts derived from a Global Information Vector (GIV) to guide how multimodal features are dynamically fused, enabling the model to adjust parameters per utterance during inference.
  • Experiments on two public multimodal conversational datasets indicate improved performance, attributed to the dynamic fusion mechanism and enhanced generalization.

Abstract

Multimodal emotion recognition in conversations (MERC) aims to identify and understand the emotions expressed by speakers during utterance interaction from multiple modalities (e.g., text, audio, images, etc.). Existing studies have shown that GCN can improve the performance of MERC by modeling dependencies between speakers. However, existing methods usually use fixed parameters to process multimodal features for different emotion types, ignoring the dynamics of fusion between different modalities, which forces the model to balance performance between multiple emotion categories, thus limiting the model's performance on some specific emotions. To this end, we propose a dynamic fusion-aware graph convolutional neural network (DF-GCN) for robust recognition of multimodal emotion features in conversations. Specifically, DF-GCN integrates ordinary differential equations into graph convolutional networks (GCNs) to {capture} the dynamic nature of emotional dependencies within utterance interaction networks and leverages the prompts generated by the global information vector (GIV) of the utterance to guide the dynamic fusion of multimodal features. This allows our model to dynamically change parameters when processing each utterance feature, so that different network parameters can be equipped for different emotion categories in the inference stage, thereby achieving more flexible emotion classification and enhancing the generalization ability of the model. Comprehensive experiments conducted on two public multimodal conversational datasets {confirm} that the proposed DF-GCN model delivers superior performance, benefiting significantly from the dynamic fusion mechanism introduced.