Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

arXiv cs.AI / 4/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study reports a large-scale audit of how well frontier LLMs can natively sample from specified probability distributions, testing 11 models across 15 distributions.
It finds a strong protocol asymmetry: batch generation yields only modest validity (median 7% pass rate), while independent stateless requests largely fail (10 of 11 models pass none).
Sampling fidelity worsens as distributional complexity increases and as the sampling horizon (N) becomes larger, showing monotonic degradation.
The authors show downstream impacts, including systematic biases in multiple-choice question generation (failing position-uniformity constraints) and demographic target violations in attribute-constrained text-to-image prompting.
The paper concludes that current LLMs do not provide a reliable internal probabilistic sampler, so applications needing statistical guarantees should rely on external sampling tools or methods.

Abstract

As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces

N{=}1000

samples within one response, and Independent Requests, comprising

N{=}1000

stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 7% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the sampling horizon

N

increases. Finally, we demonstrate how the propagation of these failures into downstream real-world application tasks introduces systematic biases: models fail to enforce uniform answer-position constraints in Multiple Choice Question generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.