Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling

arXiv cs.CV / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a cognition-inspired dual-stream model (DuSE) to improve vision-based dynamic emotion modeling by explicitly incorporating brain-inspired semantic and contextual processing mechanisms.
DuSE uses two components: a Hierarchical Temporal Prompt Cluster (HTPC) to simulate cognitive priming that aligns linguistic semantics with temporal facial dynamics, and a Latent Semantic Emotion Aggregator (LSEA) to integrate sensory cues with learned conceptual knowledge.
The method is designed to enhance dynamic facial expression recognition (DFER) by addressing limitations in existing approaches that often ignore emotion perception and cognitive theories.
Experiments on challenging in-the-wild benchmarks reportedly validate DuSE with state-of-the-art performance and improved interpretability relative to prior methods.

Abstract

The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain's strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.

Black Hat Asia

AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking

Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance

Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling

Key Points

Abstract

Related Articles

Black Hat Asia

The Complete Guide to Better Meeting Productivity with AI Note-Taking

5 Ways Real-Time AI Can Boost Your Sales Call Performance

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer