When simulations look right but causal effects go wrong: Large language models as behavioral simulators

arXiv cs.AI / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests three large language models as behavioral simulators for 11 climate-psychology interventions using a large, cross-country dataset and replicated the results on additional datasets.
  • While the models can match descriptive attitudinal patterns (e.g., beliefs and policy support) and prompting can improve this fit, that descriptive accuracy often fails to produce reliable causal estimates of intervention effects.
  • The study finds a consistent “descriptive–causal divergence,” where the error structures for descriptive fit versus causal fidelity differ and do not necessarily align.
  • Larger causal errors appear for interventions that require evoking internal experiences, and the mismatch is stronger for behavioral outcomes due to the models imposing a tighter attitude–behavior relationship than observed in human data.
  • The authors warn that using descriptive fit alone can create unwarranted confidence, potentially misleading conclusions about causal intervention impacts and obscuring fairness-related population disparities.

Abstract

Behavioral simulation is increasingly used to anticipate responses to interventions. Large language models (LLMs) enable researchers to specify population characteristics and intervention context in natural language, but it remains unclear to what extent LLMs can use these inputs to infer intervention effects. We evaluated three LLMs on 11 climate-psychology interventions using a dataset of 59,508 participants from 62 countries, and replicated the main analysis in two additional datasets (12 and 27 countries). LLMs reproduced observed patterns in attitudinal outcomes (e.g., climate beliefs and policy support) reasonably well, and prompting refinements improved this descriptive fit. However, descriptive fit did not reliably translate into causal fidelity (i.e., accurate estimates of intervention effects), and these two dimensions of accuracy followed different error structures. This descriptive-causal divergence held across the three datasets, but varied across intervention logics, with larger errors for interventions that depended on evoking internal experience than on directly conveying reasons or social cues. It was more pronounced for behavioral outcomes, where LLMs imposed stronger attitude-behavior coupling than in human data. Countries and population groups appearing well captured descriptively were not necessarily those with lower causal errors. Relying on descriptive fit alone may therefore create unwarranted confidence in simulation results, misleading conclusions about intervention effects and masking population disparities that matter for fairness.