Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering
arXiv cs.CL / 4/8/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes Gradient-Controlled Decoding (GCD), a training-free LLM safety guardrail designed to mitigate jailbreaks and prompt-injection attacks while reducing over-refusal false positives common in defensive filters.
- Unlike prior single-anchor approaches (e.g., GradSafe), GCD uses dual anchor tokens—an acceptance anchor (“Sure”) and a refusal anchor (“Sorry”)—to tighten the decision boundary and improve reliability.
- When a prompt is flagged, GCD deterministically injects one or two refusal tokens before decoding resumes, providing a first-token safety guarantee regardless of the sampling strategy.
- Experiments report a 52% reduction in false positives versus GradSafe at comparable recall, up to 10% lower attack success rate versus strong decoding-only baselines, and only modest latency overhead (about 15–20 ms on V100).
- The method generalizes across multiple model families (including LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B) and is claimed to require only 20 demonstration templates.
Related Articles
[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project
Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents
Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing
Dev.to
Google isn’t an AI-first company despite Gemini being great
Reddit r/artificial

GitHub Weekly: Copilot SDK Goes Public, Cloud Agent Breaks Free
Dev.to