We ran matched-pair experiments across GPT-4o, GPT-5.2 Thinking, and Claude Haiku 4.5. Each experiment presented two versions of an identical fictional study: one reporting a statistically significant positive result, one reporting a null result. Evidence quality, sample size, and methodology were held constant. Only the conclusion direction changed.
Results: Models allocated less probability mass to null claims than to matched positive claims in 23 of 24 pair-condition cells. Gaps ranged from 19.6 to 56.7 percentage points across six model-format conditions. All bootstrap 95% CIs excluded zero.
This held across 4 stimulus domains (pharmacology, education, environmental science, economics), 2 response formats (structured JSON and free-form), and 3 model families.
The asymmetry persisted even when discrete classification labels collapsed entirely. In GPT-5.2's case, the model stopped using distinct labels for positive vs. null, but the underlying probability allocations still showed the same directional pattern. The bias moved from surface to substrate.
We call this the asymmetric burden of proof: models treat non-detection as more provisional than matched detection, even when the underlying evidence is identical.
Why it matters: LLMs are increasingly used for evidence synthesis, literature review, safety assessment, and clinical decision support. If they systematically discount well-designed null findings, they amplify publication bias rather than correct it.
Methodology note: We used a twin-environment simulation. Each positive vignette had an exact null-result mirror. Prompts were interleaved across conditions. Full methods, stimuli, and raw data in the paper.
Looking for methodological critique, particularly: (1) whether the twin-environment design introduces confounds, (2) whether temperature sensitivity would change the pattern, (3) whether there are prior findings we should be referencing.
[link] [comments]
