DeonticBench: A Benchmark for Reasoning over Rules
arXiv cs.CL / 4/7/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces DeonticBench, a new benchmark targeting deontic reasoning for LLMs—reasoning about obligations, permissions, and prohibitions from explicit rules in long-context, high-stakes domains.
- DeonticBench contains 6,232 tasks spanning U.S. federal taxes, airline baggage policies, U.S. immigration administration, and state housing law, with options for both natural-language reasoning and solver-assisted workflows.
- The benchmark supports an optional symbolic pipeline where models translate statutes and case facts into executable Prolog, producing formal interpretations and an explicit program trace; reference Prolog programs are released for all instances.
- Results show that even frontier LLMs and coding models achieve best hard-subset performance around 44.4% (SARA Numeric) and 46.6 macro-F1 (Housing), indicating significant room for improvement in rule-grounded reasoning.
- The authors study supervised fine-tuning and reinforcement learning for symbolic program generation, finding that training improves Prolog generation quality but current RL approaches still do not solve tasks reliably.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter
TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to