A-MBER: Affective Memory Benchmark for Emotion Recognition

arXiv cs.AI / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • A-MBER is introduced as an Affective Memory Benchmark to evaluate whether AI assistants can infer a user’s current emotional state using remembered multi-session interaction history rather than only instantaneous cues.
  • The benchmark requires models to identify historically relevant evidence, ground their affective interpretation, and justify it based on an interaction trajectory and an anchor turn.
  • It is built via a staged pipeline with intermediate representations (including long-horizon planning and structured question construction) and supports judgment, retrieval, and explanation tasks.
  • Robustness is explicitly tested through settings like modality degradation and insufficient-evidence conditions to assess how well models handle missing or degraded signals.
  • Experiments compare multiple memory integration conditions and find A-MBER is particularly discriminative on long-range implicit affect and trajectory-based, dependency-heavy, and adversarial scenarios.

Abstract

AI assistants that interact with users over time need to interpret the user's current emotional state in order to respond appropriately and personally. However, this capability remains insufficiently evaluated. Existing emotion datasets mainly assess local or instantaneous affect, while long-term memory benchmarks focus largely on factual recall, temporal consistency, or knowledge updating. As a result, current resources provide limited support for testing whether a model can use remembered interaction history to interpret a user's present affective state. We introduce A-MBER, an Affective Memory Benchmark for Emotion Recognition, to evaluate this capability. A-MBER focuses on present affective interpretation grounded in remembered multi-session interaction history. Given an interaction trajectory and a designated anchor turn, a model must infer the user's current affective state, identify historically relevant evidence, and justify its interpretation in a grounded way. The benchmark is constructed through a staged pipeline with explicit intermediate representations, including long-horizon planning, conversation generation, annotation, question construction, and final packaging. It supports judgment, retrieval, and explanation tasks, together with robustness settings such as modality degradation and insufficient-evidence conditions. Experiments compare local-context, long-context, retrieved-memory, structured-memory, and gold-evidence conditions within a unified framework. Results show that A-MBER is especially discriminative on the subsets it is designed to stress, including long-range implicit affect, high-dependency memory levels, trajectory-based reasoning, and adversarial settings. These findings suggest that memory supports affective interpretation not simply by providing more history, but by enabling more selective, grounded, and context-sensitive use of past interaction