Enhancing Safety of Large Language Models via Embedding Space Separation
arXiv cs.AI / 2026/3/24
💬 オピニオンIdeas & Deep AnalysisModels & Research
要点
- The paper addresses LLM safety by leveraging findings that embeddings of harmful vs. safe queries are often linearly separable, which has enabled attacks that move harmful representations toward safe ones.
- It introduces a fine-tuning method called Embedding Space Separation (ES2) that improves safety by explicitly increasing the distance between harmful and safe representations in the embedding space.
- To avoid harming the model’s overall abilities, the method adds a KL-divergence regularization term that keeps the fine-tuned model’s logits aligned with the base model on harmless inputs.
- Experiments on multiple open-source LLMs using standard safety benchmarks show substantial safety improvements while preserving general capabilities.

