Brittlebench: Quantifying LLM robustness via prompt sensitivity
arXiv cs.LG / 3/17/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- It presents Brittlebench, a theoretical framework to quantify LLM sensitivity to prompt variants and decouple data difficulty from prompt-related variability.
- It uses semantics-preserving perturbations on popular benchmarks and shows performance can drop by up to 12%, and a single perturbation can change model rankings in 63% of cases.
- The variance decomposition reveals semantics-preserving prompt changes can account for up to half of a model's performance variance, underscoring limits of current evaluation methods.
- Brittlebench provides a new evaluation pipeline to study model brittleness and guide more robust model development.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA