ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts
arXiv cs.AI / 5/4/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes ARMOR 2025, a large-language-model safety benchmark designed for defense use cases that go beyond generic “civilian” social-risk testing.
- ARMOR 2025 is grounded in three military doctrine sources—the Law of War, Rules of Engagement, and Joint Ethics Regulation—and uses doctrinal text to create meaning-preserving multiple-choice questions.
- The benchmark is organized using an OODA (Observe–Orient–Decide–Act)-informed taxonomy to systematically test both accuracy and refusal behavior across military-relevant decision types.
- In evaluations against 21 commercial LLMs, the authors found significant gaps in safety alignment for military decision-support scenarios.
- The benchmark includes a structured 12-category taxonomy, 519 prompts, and rigorous evaluation procedures to enable more realistic assessment of legal/ethical compliance.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge
CLMA Frame Test
Dev.to
You Are Right — You Don't Need CLAUDE.md
Dev.to
Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to