Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

arXiv cs.CL / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes Gradient-Controlled Decoding (GCD), a training-free LLM safety guardrail designed to mitigate jailbreaks and prompt-injection attacks while reducing over-refusal false positives common in defensive filters.
Unlike prior single-anchor approaches (e.g., GradSafe), GCD uses dual anchor tokens—an acceptance anchor (“Sure”) and a refusal anchor (“Sorry”)—to tighten the decision boundary and improve reliability.
When a prompt is flagged, GCD deterministically injects one or two refusal tokens before decoding resumes, providing a first-token safety guarantee regardless of the sampling strategy.
Experiments report a 52% reduction in false positives versus GradSafe at comparable recall, up to 10% lower attack success rate versus strong decoding-only baselines, and only modest latency overhead (about 15–20 ms on V100).
The method generalizes across multiple model families (including LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B) and is claimed to require only 20 demonstration templates.

Abstract

Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing

Dev.to

Google isn’t an AI-first company despite Gemini being great

Reddit r/artificial

GitHub Weekly: Copilot SDK Goes Public, Cloud Agent Breaks Free

Dev.to

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Key Points

Abstract

Related Articles

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Context Windows Are Getting Absurd — And That's a Good Thing

Google isn’t an AI-first company despite Gemini being great

GitHub Weekly: Copilot SDK Goes Public, Cloud Agent Breaks Free

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer