Brittlebench: Quantifying LLM robustness via prompt sensitivity
arXiv cs.LG / 3/17/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- It presents Brittlebench, a theoretical framework to quantify LLM sensitivity to prompt variants and decouple data difficulty from prompt-related variability.
- It uses semantics-preserving perturbations on popular benchmarks and shows performance can drop by up to 12%, and a single perturbation can change model rankings in 63% of cases.
- The variance decomposition reveals semantics-preserving prompt changes can account for up to half of a model's performance variance, underscoring limits of current evaluation methods.
- Brittlebench provides a new evaluation pipeline to study model brittleness and guide more robust model development.
Related Articles
Is AI becoming a bubble, and could it end like the dot-com crash?
Reddit r/artificial

Externalizing State
Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.
Dev.to

My AI Does Not Have a Clock
Dev.to
How to settle on a coding LLM ? What parameters to watch out for ?
Reddit r/LocalLLaMA