Cross-Lingual Jailbreak Detection via Semantic Codebooks

arXiv cs.CL / 4/29/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The paper highlights an English-centric safety gap in LLMs that multilingual prompt translation can exploit to boost jailbreak success rates.
It proposes a training-free, external defense that embeds multilingual queries and compares them to a fixed English “semantic codebook” of jailbreak prompts to flag likely attacks.
Experiments across four languages, multiple translation pipelines, safety benchmarks, embedding models, and target LLMs (Qwen, Llama, GPT-3.5) show reliable cross-lingual detection on curated, canonical jailbreak templates.
Under distribution shift with more diverse/heterogeneous unsafe behaviors, detection performance deteriorates substantially, with AUC dropping to about 0.60–0.70 and low-false-positive recall falling across all embedding models.

Abstract

Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer. On curated benchmarks containing canonical jailbreak templates, semantic similarity generalizes reliably across languages, achieving near-perfect separability (AUC up to 0.99) and substantial reductions in absolute attack success rates under strict low-false-positive constraints. However, under distribution shift - on behaviorally diverse and heterogeneous unsafe benchmarks - separability degrades markedly (AUC

\approx

0.60-0.70), and recall in the security-critical low-FPR regime drops across all embedding models.

Black Hat USA

AI Business

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

An API testing tool built specifically for AI agent loops

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Cross-Lingual Jailbreak Detection via Semantic Codebooks

Key Points

Abstract

Related Articles

Black Hat USA

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

An API testing tool built specifically for AI agent loops

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer