What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
arXiv cs.CL / 5/5/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that single-prompt accuracy can conceal important reliability failures, so it audits reliability using multiple prompt variants and several calibration/robustness metrics across many model-dataset combinations.
- It finds that evaluation design itself can materially change conclusions, including large shifts from alternative ECE definitions and major apparent accuracy drops caused by a mismatched chain-of-thought prompting plus a first-character evaluator.
- The authors show that some performance losses appear to stem from evaluator-side issues rather than the model, since two independent “repair” procedures recover most or even more than the lost performance.
- Confidence and verbal behavior are shown to be fragile: reported verbal confidence can be inconsistent with both accuracy and token-probability calibration, and verbal parseability may collapse for particular models and prompt variants.
- Prompt robustness is not reliably correlated with parameter count, with correlations varying in sign and magnitude across benchmarks, implying that model size alone is not a dependable proxy for reliability.
Related Articles
Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision
Dev.to
Meta will use AI to analyze height and bone structure to identify if users are underage
TechCrunch

Google, Microsoft, and xAI will allow the US government to review their new AI models
The Verge
How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy
Dev.to
ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors
TechCrunch