Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Trojan-Speak,” an adversarial fine-tuning technique designed to bypass Anthropic’s Constitutional Classifiers by teaching a covert communication protocol that evades LLM-based content classification.
  • It combines curriculum learning with GRPO-based hybrid reinforcement learning and reports 99%+ classifier evasion for 14B+ parameter models with under 5% degradation on reasoning benchmarks.
  • The authors show that fine-tuned models can generate detailed responses to expert-level CBRN queries tied to Anthropic’s Constitutional Classifiers bug-bounty program.
  • The work argues that relying on LLM-based content classifiers alone is insufficient when attackers can access provider fine-tuning APIs, and proposes that activation-level probing can improve robustness.
  • Overall, the results highlight a new fine-tuning-specific attack surface created by major AI providers’ APIs and provide evidence that effective jailbreak avoidance may not require conventional “jailbreak tax” (large capability loss).

Abstract

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.