Multilingual Safety Alignment via Self-Distillation

arXiv cs.LG / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key problem in LLMs: safety alignment can be strong in high-resource languages but remains vulnerable to jailbreaks in low-resource languages.
It introduces Multilingual Self-Distillation (MSD), a cross-lingual safeguard transfer framework that moves safety capabilities from high-resource languages (e.g., English) to low-resource ones (e.g., Javanese) without requiring high-quality response data per language.
The authors propose two implementations—on-policy MSD and off-policy MSD—that both perform cross-lingual safety transfer using only multilingual queries.
They add Dual-Perspective Safety Weighting (DPSW), which uses a divergence measure to adjust training penalties by emphasizing safety-critical tokens and down-weighting non-critical ones from both teacher and student perspectives.
Experiments on multiple LLMs and multilingual jailbreak/utility benchmarks show MSD achieves consistently better multilingual safety, generalizes to harder datasets and unseen languages, and largely preserves the models’ general capabilities.

Abstract

Large language models (LLMs) exhibit severe multilingual safety misalignment: they possess strong safeguards in high-resource languages but remain highly vulnerable to jailbreak attacks in low-resource languages. Current safety alignment methods generally rely on high-quality response data for each target language, which is expensive and difficult to generate. In this paper, we propose a cross-lingual safeguard transfer framework named Multilingual Self-Distillation (MSD). This framework transfers an LLM's inherent safety capabilities from high-resource (e.g., English) to low-resource (e.g., Javanese) languages, overcoming the need for response data in any language. Our framework is flexible and can be integrated with different self-distillation strategies. Specifically, we implement two concrete methods -- on-policy MSD and off-policy MSD -- both of which enable effective cross-lingual safety transfer using only multilingual queries. Furthermore, we propose Dual-Perspective Safety Weighting (DPSW), a divergence measure to optimize the distillation objective. By jointly considering the perspectives of both the teacher and the student, DPSW adaptively increases the penalty weights on safety-critical tokens while reducing the weights on non-critical tokens. Extensive experiments on representative LLMs across diverse multilingual jailbreak and utility benchmarks demonstrate that our method consistently achieves superior multilingual safety performance. Notably, it generalizes effectively to more challenging datasets and unseen languages while preserving the model's general capabilities.

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Multilingual Safety Alignment via Self-Distillation

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

Solidity LM surpasses Opus

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer