AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

arXiv cs.CV / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces AICA-Bench to evaluate Vision-Language Models (VLMs) on holistic Affective Image Content Analysis across three tasks: Emotion Understanding, Emotion Reasoning, and Emotion-Guided Content Generation.
  • Experiments across 23 VLMs find two key weaknesses: poor intensity calibration and shallow performance on open-ended emotional descriptions.
  • To mitigate these issues, the authors propose Grounded Affective Tree (GAT) Prompting, a training-free approach that uses visual scaffolding and hierarchical reasoning.
  • Results indicate GAT reduces emotion intensity errors and improves the depth of generated or described content, establishing a baseline for future affective multimodal research.

Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.