TEMPER: Testing Emotional Perturbation in Quantitative Reasoning
arXiv cs.CL / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces TEMPER, a controlled emotion-translation framework that rewrites quantitative reasoning questions into emotional variants while preserving all numerical quantities and relationships.
- Using TEMPER, the authors build Temper-5400 (5,400 verified emotion–neutral pairs) spanning GSM8K, MultiArith, and ARC-Challenge and evaluate it across 18 language models from ~1B to frontier scale.
- They find that emotional framing alone can reduce quantitative reasoning accuracy by 2–10 percentage points even when numerical content is unchanged.
- Neutralizing the emotional variants at inference time recovers most of the lost accuracy, indicating the degradation comes from emotional style rather than content corruption.
- The authors argue the benchmark-building method generalizes beyond emotion, enabling broader robustness testing via controlled stylistic translation.



