TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces TEMPER, a controlled emotion-translation framework that rewrites quantitative reasoning questions into emotional variants while preserving all numerical quantities and relationships.
  • Using TEMPER, the authors build Temper-5400 (5,400 verified emotion–neutral pairs) spanning GSM8K, MultiArith, and ARC-Challenge and evaluate it across 18 language models from ~1B to frontier scale.
  • They find that emotional framing alone can reduce quantitative reasoning accuracy by 2–10 percentage points even when numerical content is unchanged.
  • Neutralizing the emotional variants at inference time recovers most of the lost accuracy, indicating the degradation comes from emotional style rather than content corruption.
  • The authors argue the benchmark-building method generalizes beyond emotion, enabling broader robustness testing via controlled stylistic translation.

Abstract

Large language models are trained and evaluated on quantitative reasoning tasks written in clean, emotionally neutral language. However, real-world queries are often wrapped in frustration, urgency or enthusiasm. Does emotional framing alone degrade reasoning when all numerical content is preserved? To investigate this, a controlled emotion translation framework is developed that rewrites problems into emotional variants while preserving all quantities and relationships. Using this framework, Temper-5400 (5,400 semantically verified emotion--neutral pairs) is constructed across GSM8K, MultiArith, and ARC-Challenge, and evaluated on eighteen models (1B to frontier scale). Two core results emerge: First, emotional framing reduces accuracy by 2-10 percentage points even though all numerical content is preserved. Second, neutralizing emotional variants recovers most of the lost performance, showing both that the degradation is tied to emotional style rather than content corruption and that neutralization can serve as a lightweight inference-time mitigation. Non-emotional paraphrases cause no such degradation, implicating emotional content rather than surface-level changes. Beyond emotion specifically, the benchmark construction procedure provides a general framework for controlled stylistic translation and robustness evaluation.