Comparative reversal learning reveals rigid adaptation in LLMs under non-stationary uncertainty

arXiv cs.AI / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how large language models behave as sequential decision policies in non-stationary reversal-learning tasks with switch events driven by performance criteria or timeouts.
  • Across DeepSeek-V3.2, Gemini-3, and GPT-5.2, win-stay behavior is near ceiling while lose-shift is significantly weaker, indicating asymmetric reliance on positive versus negative evidence.
  • Models show different adaptation profiles: DeepSeek-V3.2 exhibits strong perseveration and weak acquisition after reversals, whereas Gemini-3 and GPT-5.2 adapt faster but remain less sensitive to losses than humans.
  • Introducing random transition schedules that increase volatility amplifies reversal-specific persistence without necessarily reducing overall win rates, suggesting rigid adaptation can coexist with high aggregate performance.
  • Hierarchical RL analyses suggest rigidity may stem from weak loss learning, overly deterministic policies, or value polarization due to counterfactual suppression, motivating volatility-aware evaluation diagnostics for LLMs.

Abstract

Non-stationary environments require agents to revise previously learned action values when contingencies change. We treat large language models (LLMs) as sequential decision policies in a two-option probabilistic reversal-learning task with three latent states and switch events triggered by either a performance criterion or timeout. We compare a deterministic fixed transition cycle to a stochastic random schedule that increases volatility, and evaluate DeepSeek-V3.2, Gemini-3, and GPT-5.2, with human data as a behavioural reference. Across models, win-stay was near ceiling while lose-shift was markedly attenuated, revealing asymmetric use of positive versus negative evidence. DeepSeek-V3.2 showed extreme perseveration after reversals and weak acquisition, whereas Gemini-3 and GPT-5.2 adapted more rapidly but still remained less loss-sensitive than humans. Random transitions amplified reversal-specific persistence across LLMs yet did not uniformly reduce total wins, demonstrating that high aggregate payoff can coexist with rigid adaptation. Hierarchical reinforcement-learning (RL) fits indicate dissociable mechanisms: rigidity can arise from weak loss learning, inflated policy determinism, or value polarisation via counterfactual suppression. These results motivate reversal-sensitive diagnostics and volatility-aware models for evaluating LLMs under non-stationary uncertainty.