ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ARGen, a two-stage framework for improving dynamic facial expression/emotion recognition in unconstrained (“in the wild”) settings where data is scarce and emotions follow long-tail distributions.
  • ARGen uses Affective Semantic Injection (ASI) to align affective knowledge by leveraging facial Action Units and retrieval-augmented prompt generation with large visual-language models to produce interpretable emotional priors.
  • It then applies Adaptive Reinforcement Diffusion (ARD), a text-conditioned image-to-video diffusion approach enhanced with reinforcement learning to improve temporal consistency via inter-frame conditional guidance.
  • A multi-objective reward function jointly optimizes generated expression naturalness, facial integrity, and generative efficiency, targeting both synthesis quality and downstream recognition accuracy.
  • Experiments reportedly validate that ARGen boosts both generation fidelity and recognition performance, offering a generally applicable, interpretable generative augmentation paradigm for affective/vision-based emotion perception.

Abstract

Dynamic facial expression recognition in the wild remains challenging due to data scarcity and long-tail distributions, which hinder models from effectively learning the temporal dynamics of scarce emotions. To address these limitations, we propose ARGen, an Affect-Reinforced Generative Augmentation Framework that enables data-adaptive dynamic expression generation for robust emotion perception. ARGen operates in two stages: Affective Semantic Injection (ASI) and Adaptive Reinforcement Diffusion (ARD). The ASI stage establishes affective knowledge alignment through facial Action Units and employs a retrieval-augmented prompt generation strategy to synthesize consistent and fine-grained affective descriptions via large-scale visual-language models, thereby injecting interpretable emotional priors into the generation process. The ARD stage integrates text-conditioned image-to-video diffusion with reinforcement learning, introducing inter-frame conditional guidance and a multi-objective reward function to jointly optimize expression naturalness, facial integrity, and generative efficiency. Extensive experiments on both generation and recognition tasks verify that ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm for vision-based affective computing.