Modeling Induced Pleasure through Cognitive Appraisal Prediction via Multimodal Fusion

arXiv cs.AI / 4/28/2026

📰 NewsModels & Research

Key Points

  • The paper studies how visual content in video influences a viewer’s cognitive appraisal and produces specific affective experiences like pleasure, addressing a gap in multimodal affective computing.
  • It proposes a new computational model that predicts video-induced pleasure by estimating cognitive appraisal variables, aiming to clarify why “positive emotions” differ from “pleasure.”
  • The method tackles practical research challenges including inconsistent/noisy human labels, limited availability of pleasure-specific datasets, and poor interpretability of existing black-box multimodal fusion approaches.
  • Using transformer-based multimodal feature extraction with attention and an interpretable fusion design, the model targets both inter- and intra-modal dynamics relevant to pleasure.
  • Experiments report peak accuracy of 0.6624 for predicting pleasure levels, and the results suggest potential for affective recommendation and more explainable intelligent media creation.

Abstract

Multimodal affective computing analyzes user-generated social media content to predict emotional states. However, a critical gap remains in understanding how visual content shapes cognitive interpretations and elicits specific affective experiences such as pleasure. This study introduces a novel computational model to infer video-induced pleasure via cognitive appraisal variables. The proposed model addresses four challenges: (1) noisy and inconsistent human labels, (2) the semantic gap between "positive emotions" and "pleasure," (3) the scarcity of pleasure-specific datasets, and (4) the limited interpretability of existing black-box fusion methods. Our approach integrates data-driven and cognitive theory-driven methods, using cognitive appraisal theory and a fuzzy model within an innovative framework. The model employs transformer-based architectures and attention mechanisms for fine-grained multimodal feature extraction and interpretable fusion to capture both inter- and intra-modal dynamics associated with pleasure. This enables the prediction of underlying appraisal variables, thereby bridging the semantic gap and enhancing model explainability beyond conventional statistical associations. Experimental results validate the efficacy of the proposed method in detecting video-induced pleasure, achieving a peak accuracy of 0.6624 in predicting pleasure levels. These findings highlight promising implications for affective content recommendation, intelligent media creation, and advancing our understanding of how digital media influences human emotions.