Variational Encoder--Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition

arXiv cs.AI / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces VE-MD (Variational Encoder–Multi-Decoder) for Group Emotion Recognition (GER) that is designed to reduce privacy risks by avoiding person-centric outputs like identity or per-person emotion estimates.
Instead of formal anonymization, VE-MD constrains learning to predict only aggregate group-level affect while jointly learning shared latent representations with internal structural decoding (body and facial structure).
Two structural decoding approaches are evaluated—a transformer-based PersonQuery decoder and a dense heatmap decoder—where the heatmap method more naturally supports variable group sizes.
Experiments across six in-the-wild datasets show structural supervision improves representation learning, and the study finds a key behavioral difference: GER benefits from preserving interaction-related structural cues, while IER can be helped by structural representations acting as a denoising bottleneck.
The method reports state-of-the-art results on GER benchmarks (e.g., GAF-3.0 up to 90.06% and VGAF up to 82.25% with audio fusion) and competitive-to-strong performance on several individual emotion benchmarks under multimodal settings.

Abstract

Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed. This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs. VE-MD learns a shared latent representation jointly optimized for emotion classification and internal prediction of body and facial structural representations. Two structural decoding strategies are investigated: a transformer-based PersonQuery decoder and a dense Heatmap decoder that naturally accommodates variable group sizes. Experiments on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, show that structural supervision consistently improves representation learning. More importantly, the results reveal a clear distinction between GER and IER: optimizing the latent space alone is often insufficient for GER because it tends to attenuate interaction-related cues, whereas preserving explicit structural outputs improves collective affect inference. In contrast, projected structural representations seem to act as an effective denoising bottleneck for IER. VE-MD achieves state-of-the-art performance on GAF-3.0 (up to 90.06%) and VGAF (82.25% with multimodal fusion with audio). These results show that preserving interaction-related structural information is particularly beneficial for group-level affect modeling without relying on prior individual feature extraction. On IER datasets using multimodal fusion with audio modality, VE-MD outperforms SOTA on SamSemo (77.9%, adding text modality) while achieving competitive performances on MER-MULTI (63.8%), DFEW (70.7%) and EngageNet (69.0).