What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

arXiv cs.CL / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that single-prompt accuracy can conceal important reliability failures, so it audits reliability using multiple prompt variants and several calibration/robustness metrics across many model-dataset combinations.
It finds that evaluation design itself can materially change conclusions, including large shifts from alternative ECE definitions and major apparent accuracy drops caused by a mismatched chain-of-thought prompting plus a first-character evaluator.
The authors show that some performance losses appear to stem from evaluator-side issues rather than the model, since two independent “repair” procedures recover most or even more than the lost performance.
Confidence and verbal behavior are shown to be fragile: reported verbal confidence can be inconsistent with both accuracy and token-probability calibration, and verbal parseability may collapse for particular models and prompt variants.
Prompt robustness is not reliably correlated with parameter count, with correlations varying in sign and magnitude across benchmarks, implying that model size alone is not a dependable proxy for reliability.

Abstract

Single-prompt accuracy is the dominant way to benchmark language models, but it can miss reliability failures that matter. We evaluate a 15-model open-weight corpus, with the main reliability analyses focused on 10 instruct models across five classification and reasoning benchmarks under five prompt variants each, measuring accuracy, token-probability calibration, verbal-confidence calibration, verbal parse rate, and prompt-perturbation spread for every (model x dataset x variant) cell. We find three broad results. First, evaluation design can materially change the conclusion. Switching Expected Calibration Error (ECE) token from a raw to a label-set-normalised definition changes per-cell calibration by a mean absolute 0.149. More strikingly, pairing a chain-of-thought prompt with a first-character evaluator on ARC-Challenge reduces apparent accuracy by 72-88% across all five primary models; two independent repair procedures recover 93.8% and 102.7% of the lost performance, indicating an evaluator-side rather than model-side failure. Second, confidence signals are fragile. On MMLU-Pro, every primary model verbally reports confidence substantially above both its accuracy and its token-probability confidence on the same rows, and verbal parse rate can collapse for a single model on a single prompt variant. Third, prompt robustness does not track parameter count reliably. Across 10 instruct models, the correlation between model size and prompt-perturbation spread ranges from -0.244 to 0.474 across benchmarks. Taken together, these results show that reliability conclusions for small language models depend not only on the model being evaluated, but also on the evaluation pipeline used to measure it. We argue that calibration definitions, evaluator logic, verbal parseability, and prompt robustness should be reported explicitly when making reliability claims.

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage

TechCrunch

Google, Microsoft, and xAI will allow the US government to review their new AI models

The Verge

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

Dev.to

ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors

TechCrunch

What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

Key Points

Abstract

Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Meta will use AI to analyze height and bone structure to identify if users are underage

Google, Microsoft, and xAI will allow the US government to review their new AI models

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

ElevenLabs lists BlackRock, Jamie Foxx and Longoria as new investors

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer