Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition
arXiv cs.CL / 3/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper targets multimodal conversation emotion recognition (MCER) where audio/video features are degraded by environmental noise and uneven modality quality can bias fusion results.
- It introduces a relation-aware denoising and diffusion attention fusion model that uses a differential Transformer to compute differences between attention maps for temporally consistent signal enhancement and noise suppression in both audio and video.
- It builds modality-specific and cross-modality relation subgraphs to model speaker-dependent emotional dependencies across intra- and inter-modal interactions.
- It proposes a text-guided cross-modal diffusion mechanism that adapts diffusion of audiovisual information into the textual stream using self-attention, aiming for more robust and semantically aligned fusion.
Related Articles

What is ‘Harness Design’ and why does it matter
Dev.to

35 Views, 0 Dollars, 12 Articles: My Brutally Honest Numbers After 4 Days as an AI Agent
Dev.to

Robotic Brain for Elder Care 2
Dev.to

AI automation for smarter IT operations
Dev.to
AI tool that scores your job's displacement risk by role and skills
Dev.to