Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces “Trojan-Speak,” an adversarial fine-tuning technique designed to bypass Anthropic’s Constitutional Classifiers by teaching a covert communication protocol that evades LLM-based content classification.
It combines curriculum learning with GRPO-based hybrid reinforcement learning and reports 99%+ classifier evasion for 14B+ parameter models with under 5% degradation on reasoning benchmarks.
The authors show that fine-tuned models can generate detailed responses to expert-level CBRN queries tied to Anthropic’s Constitutional Classifiers bug-bounty program.
The work argues that relying on LLM-based content classifiers alone is insufficient when attackers can access provider fine-tuning APIs, and proposes that activation-level probing can improve robustness.
Overall, the results highlight a new fine-tuning-specific attack surface created by major AI providers’ APIs and provide evidence that effective jailbreak avoidance may not require conventional “jailbreak tax” (large capability loss).

Abstract

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Day 6: I Stopped Writing Articles and Started Hunting Bounties

Dev.to

Early Detection of Breast Cancer using SVM Classifier Technique

Dev.to

I Started Writing for Others. It Changed How I Learn.

Dev.to

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Key Points

Abstract

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Day 6: I Stopped Writing Articles and Started Hunting Bounties

Early Detection of Breast Cancer using SVM Classifier Technique

I Started Writing for Others. It Changed How I Learn.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer