Guardrails in Logit Space: Safety Token Regularization for LLM Alignment

arXiv cs.LG / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that fine-tuning LLMs on new domains can degrade safety alignment even when the fine-tuning data is benign.
It proposes Safety Token Regularization (STR), which constrains logits for salient tokens identified from rejection templates of a well-aligned model to preserve critical safety behaviors during training.
STR is positioned as a lightweight alternative to reinforcement learning or preference optimization methods, requiring minimal extra computation and integrating smoothly with parameter-efficient tuning like LoRA.
Experiments report that STR matches state-of-the-art safety performance while maintaining task utility, and it also improves training stability and overall performance.
The authors present STR as a practical, deployable approach for continual safety alignment of fine-tuned LLMs over time.

Abstract

Fine-tuning well-aligned large language models (LLMs) on new domains often degrades their safety alignment, even when using benign datasets. Existing safety alignment techniques primarily focus on pretraining, leaving fine-tuned models vulnerable to behavioral shifts. In this work, we introduce safety token regularization (STR), a lightweight method designed to preserve safety properties during fine-tuning. Our approach identifies salient tokens from rejection templates of well-aligned models and constrains their associated logits during training, preventing the loss of critical safety behaviors. Unlike reinforcement learning or preference optimization methods, STR requires minimal additional computation and seamlessly integrates with parameter-efficient fine-tuning techniques such as LoRA. Comprehensive experiments demonstrate that our approach achieves safety performance on par with state-of-the-art methods, while preserving task-specific utility and requiring minimal implementation overhead. Furthermore, we show that safety token regularization enhances training stability and overall performance beyond safety considerations alone. This work offers a practical and readily deployable strategy for continual safety alignment in fine-tuned LLMs.