Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling
arXiv cs.AI / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard LLM safety benchmarks (e.g., HELM, AIR-BENCH) may miss “operational” risks that appear when the same prompt is generated repeatedly in real deployments.
- It introduces Accelerated Prompt Stress Testing (APST), a depth-oriented framework that repeatedly samples identical prompts while varying temperature and applying controlled prompt perturbations to uncover latent failure modes like hallucinations, inconsistent refusals, and unsafe completions.
- The methodology treats failures as stochastic outcomes of repeated inference and uses Bernoulli/binomial modeling to estimate per-inference failure probabilities, allowing quantitative comparisons across models and configurations.
- Experiments on multiple instruction-tuned LLMs using AIR-BENCH 2024-derived safety/security prompts show that models can look similar under shallow evaluation (N≤3) but diverge substantially under repeated sampling, especially across temperatures.
- The authors conclude that relying on shallow benchmark scores can obscure meaningful differences in safety reliability during sustained use.
Related Articles

Black Hat Asia
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning
Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale
Dev.to
Bit of a strange question?
Reddit r/artificial