AI Navigate

[R] We tested whether LLMs apply the same evidential standard to positive vs. null results: They don’t.

Reddit r/MachineLearning / 3/18/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The authors conducted matched-pair experiments across GPT-4o, GPT-5.2 Thinking, and Claude Haiku 4.5 using identical fictional studies that differed only in concluding direction (positive vs null).
  • They found that models allocated less probability to null conclusions than to positive ones in 23 of 24 pair-condition cells, with gaps ranging from 19.6 to 56.7 percentage points and bootstrap 95% CIs excluding zero.
  • The asymmetric burden of proof persisted across four domains, two response formats, and three model families, and in GPT-5.2 the model stopped using distinct labels for positive vs null while the allocations remained directional.
  • They term this the asymmetric burden of proof and warn it could amplify publication bias in evidence synthesis, safety assessment, and clinical decision support.

We ran matched-pair experiments across GPT-4o, GPT-5.2 Thinking, and Claude Haiku 4.5. Each experiment presented two versions of an identical fictional study: one reporting a statistically significant positive result, one reporting a null result. Evidence quality, sample size, and methodology were held constant. Only the conclusion direction changed.

Results: Models allocated less probability mass to null claims than to matched positive claims in 23 of 24 pair-condition cells. Gaps ranged from 19.6 to 56.7 percentage points across six model-format conditions. All bootstrap 95% CIs excluded zero.

This held across 4 stimulus domains (pharmacology, education, environmental science, economics), 2 response formats (structured JSON and free-form), and 3 model families.

The asymmetry persisted even when discrete classification labels collapsed entirely. In GPT-5.2's case, the model stopped using distinct labels for positive vs. null, but the underlying probability allocations still showed the same directional pattern. The bias moved from surface to substrate.

We call this the asymmetric burden of proof: models treat non-detection as more provisional than matched detection, even when the underlying evidence is identical.

Why it matters: LLMs are increasingly used for evidence synthesis, literature review, safety assessment, and clinical decision support. If they systematically discount well-designed null findings, they amplify publication bias rather than correct it.

Methodology note: We used a twin-environment simulation. Each positive vignette had an exact null-result mirror. Prompts were interleaved across conditions. Full methods, stimuli, and raw data in the paper.

Looking for methodological critique, particularly: (1) whether the twin-environment design introduces confounds, (2) whether temperature sensitivity would change the pattern, (3) whether there are prior findings we should be referencing.

Paper: https://zenodo.org/records/18867694

submitted by /u/galigirii
[link] [comments]