CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models
arXiv cs.AI / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The CEI Benchmark is introduced as a dataset of 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances.
- Each scenario pairs situational context and speaker/listener roles with explicit power relations across five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) and three power configurations spanning workplace, family, social, and service settings.
- Three trained annotators independently labeled every scenario, and the authors note low inter-annotator agreement (Fleiss' kappa 0.06–0.25 by subtype) while arguing that disagreement is informative, supported by a four-level quality-control pipeline.
- CEI is released under CC-BY-4.0 to serve as a standardized benchmark for pragmatic inference in language models.
Related Articles
Is AI becoming a bubble, and could it end like the dot-com crash?
Reddit r/artificial

Externalizing State
Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.
Dev.to

My AI Does Not Have a Clock
Dev.to
How to settle on a coding LLM ? What parameters to watch out for ?
Reddit r/LocalLLaMA