Measuring and Eliminating Refusals in Military Large Language Models
arXiv cs.AI / 3/12/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The article presents a gold benchmark for measuring refusal rates in military LLMs, developed by US Army veterans, claimed to be the first dataset of its kind.
- It reports hard rejection rates as high as 98.2% and soft deflection rates ranging from 0% to 21.3% across 31 public models and 3 military models.
- It analyzes correlations with two additional synthetic datasets and shows their relationship to the gold dataset.
- An ablation using the Heretic library on a military-tuned gpt-oss-20b model yields a 66.5-point absolute increase in answer rate, alongside a 2% average relative decrease on other military tasks, underscoring trade-offs in safety tuning.
- In their concluding remarks, the authors call for deeper specialization, including mid-training and end-to-end post-training, to achieve zero refusals and maximum military task accuracy for closed military models.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA