[R] We tested whether LLMs apply the same evidential standard to positive vs. null results: They don’t.

Reddit r/MachineLearning / 3/18/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors conducted matched-pair experiments across GPT-4o, GPT-5.2 Thinking, and Claude Haiku 4.5 using identical fictional studies that differed only in concluding direction (positive vs null).
They found that models allocated less probability to null conclusions than to positive ones in 23 of 24 pair-condition cells, with gaps ranging from 19.6 to 56.7 percentage points and bootstrap 95% CIs excluding zero.
The asymmetric burden of proof persisted across four domains, two response formats, and three model families, and in GPT-5.2 the model stopped using distinct labels for positive vs null while the allocations remained directional.
They term this the asymmetric burden of proof and warn it could amplify publication bias in evidence synthesis, safety assessment, and clinical decision support.

We ran matched-pair experiments across GPT-4o, GPT-5.2 Thinking, and Claude Haiku 4.5. Each experiment presented two versions of an identical fictional study: one reporting a statistically significant positive result, one reporting a null result. Evidence quality, sample size, and methodology were held constant. Only the conclusion direction changed.

Results: Models allocated less probability mass to null claims than to matched positive claims in 23 of 24 pair-condition cells. Gaps ranged from 19.6 to 56.7 percentage points across six model-format conditions. All bootstrap 95% CIs excluded zero.

This held across 4 stimulus domains (pharmacology, education, environmental science, economics), 2 response formats (structured JSON and free-form), and 3 model families.

The asymmetry persisted even when discrete classification labels collapsed entirely. In GPT-5.2's case, the model stopped using distinct labels for positive vs. null, but the underlying probability allocations still showed the same directional pattern. The bias moved from surface to substrate.

We call this the asymmetric burden of proof: models treat non-detection as more provisional than matched detection, even when the underlying evidence is identical.

Why it matters: LLMs are increasingly used for evidence synthesis, literature review, safety assessment, and clinical decision support. If they systematically discount well-designed null findings, they amplify publication bias rather than correct it.

Methodology note: We used a twin-environment simulation. Each positive vignette had an exact null-result mirror. Prompts were interleaved across conditions. Full methods, stimuli, and raw data in the paper.

Looking for methodological critique, particularly: (1) whether the twin-environment design introduces confounds, (2) whether temperature sensitivity would change the pattern, (3) whether there are prior findings we should be referencing.

Paper: https://zenodo.org/records/18867694

submitted by /u/galigirii
[link] [comments]

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory

Dev.to

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

[R] We tested whether LLMs apply the same evidential standard to positive vs. null results: They don’t.

Key Points

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer