When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

arXiv cs.LG / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 研究は、マルチエージェントの交渉・政策シミュレーションで「推論力が高いほど再現性(シミュレーションの忠実度)が上がる」という前提が常に成り立たない場合があると指摘している。
  • 目的が戦略問題の“解を解く”ことではなく、“制約付き合理性(boundedly rational)”に基づく行動を“サンプルする”ことにあると、推論強化モデルは戦略的に優位な行動へ過適合して、妥協志向の終端挙動が崩れうる。
  • 「solver(解く者)としては強くなるが、sampler(振る舞いを生成する者)としては悪くなる」という solver-sampler mismatch を、3つの交渉/電力管理の環境で分析している。
  • 反省(reflection)の条件を比較し、無反省・ネイティブ推論よりも「bounded reflection」が、より多様で妥協志向の軌跡を大きく生み出すことを示している。
  • OpenAIのGPT-4.1/GPT-5.2での追加検証でも、GPT-5.2はネイティブ推論だと権限(authority)決定に寄りやすい一方、bounded reflectionでは各環境で妥協結果を回復する例が確認されている。

Abstract

Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should improve simulation fidelity. We argue that this assumption can fail when the objective is not to solve a strategic problem, but to sample plausible boundedly rational behavior. In such settings, reasoning-enhanced models can become better solvers and worse simulators: they can over-optimize for strategically dominant actions, collapse compromise-oriented terminal behavior, and sometimes exhibit a diversity-without-fidelity pattern in which local variation survives without outcome-level fidelity. We study this solver-sampler mismatch in three multi-agent negotiation environments adapted from earlier simulation work: an ambiguous fragmented-authority trading-limits scenario, an ambiguous unified-opposition trading-limits scenario, and a new-domain grid-curtailment case in emergency electricity management. We compare three reflection conditions, no reflection, bounded reflection, and native reasoning, across two primary model families and then extend the same protocol to direct OpenAI runs with GPT-4.1 and GPT-5.2. Across all three experiments, bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. In the direct OpenAI extension, GPT-5.2 native ends in authority decisions in 45 of 45 runs across the three experiments, while GPT-5.2 bounded recovers compromise outcomes in every environment. The contribution is not a claim that reasoning is generally harmful. It is a methodological warning: model capability and simulation fidelity are different objectives, and behavioral simulation should qualify models as samplers, not only as solvers.