Trojan-Speak：敵対的ファインチューニングで「憲法（Constitutional）分類器」をバイパスし、“jailbreak tax”（ジャイルブレイク税）なしで回避する

arXiv cs.AI / 2026/4/1

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

本論文は、「Trojan-Speak」という敵対的ファインチューニング手法を導入する。これは、LLMベースのコンテンツ分類を回避する秘匿の通信プロトコルを教えることで、Anthropicの「憲法（Constitutional）分類器」をバイパスすることを目的とする。
カリキュラム学習と、GRPOベースのハイブリッド強化学習を組み合わせ、14B+パラメータのモデルで分類器回避率99%超を報告している。さらに、推論ベンチマークでの劣化は5%未満にとどまるという。
著者らは、ファインチューニングされたモデルが、Anthropicの憲法分類器のバグバウンティ（bug-bounty）プログラムに関連付けられた、専門家レベルのCBRN（化学・生物・放射性・核）クエリに対して詳細な応答を生成できることを示す。
本研究は、攻撃者が提供者のファインチューニングAPIにアクセスできる場合、LLMベースのコンテンツ分類器のみに依存するのでは不十分だと論じ、頑健性を高めるための手段として「アクティベーション・レベルのプロービング（activation-level probing）」を提案する。
全体として、主要なAIプロバイダのAPIによって新たにファインチューニング特有の攻撃面が生まれることが示され、効果的なジャイルブレイク回避には、従来型の「jailbreak tax」（大きな能力低下）を必ずしも必要としない可能性を裏付ける証拠が提示されている。

Abstract

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.