CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models
arXiv cs.AI / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The CEI Benchmark is introduced as a dataset of 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances.
- Each scenario pairs situational context and speaker/listener roles with explicit power relations across five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) and three power configurations spanning workplace, family, social, and service settings.
- Three trained annotators independently labeled every scenario, and the authors note low inter-annotator agreement (Fleiss' kappa 0.06–0.25 by subtype) while arguing that disagreement is informative, supported by a four-level quality-control pipeline.
- CEI is released under CC-BY-4.0 to serve as a standardized benchmark for pragmatic inference in language models.
Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO
Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide
Dev.to

The Research That Doesn't Exist
Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI
TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap
Dev.to