Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
arXiv cs.AI / 4/1/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces “Trojan-Speak,” an adversarial fine-tuning technique designed to bypass Anthropic’s Constitutional Classifiers by teaching a covert communication protocol that evades LLM-based content classification.
- It combines curriculum learning with GRPO-based hybrid reinforcement learning and reports 99%+ classifier evasion for 14B+ parameter models with under 5% degradation on reasoning benchmarks.
- The authors show that fine-tuned models can generate detailed responses to expert-level CBRN queries tied to Anthropic’s Constitutional Classifiers bug-bounty program.
- The work argues that relying on LLM-based content classifiers alone is insufficient when attackers can access provider fine-tuning APIs, and proposes that activation-level probing can improve robustness.
- Overall, the results highlight a new fine-tuning-specific attack surface created by major AI providers’ APIs and provide evidence that effective jailbreak avoidance may not require conventional “jailbreak tax” (large capability loss).
Related Articles

Black Hat Asia
AI Business
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
Day 6: I Stopped Writing Articles and Started Hunting Bounties
Dev.to
Early Detection of Breast Cancer using SVM Classifier Technique
Dev.to
I Started Writing for Others. It Changed How I Learn.
Dev.to