Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning
arXiv cs.LG / 4/2/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces “ThoughtSteer,” a backdoor attack on continuous latent-reasoning language models that produces hijacked outputs while emitting no token-level trace of the manipulation.
- By perturbing a single input-layer embedding vector, the attacker leverages the model’s own multi-pass latent reasoning to amplify the change into a controlled latent trajectory that yields the attacker’s chosen answer.
- Experiments across two model architectures (Coconut, SimCoT), three reasoning benchmarks, and model sizes from 124M to 3B show ≥99% attack success with near-baseline clean accuracy, strong transfer to held-out benchmarks (94–100%), and evasion of five evaluated active defenses.
- The work attributes failures of token-level defenses to a latent-space phenomenon (“Neural Collapse”) that forces representations toward a geometric attractor, and it claims effective backdoors must have a linearly separable signature (probe AUC ≥ 0.999).
- The authors highlight a mechanistic interpretability paradox: correct answer information can still be present in individual latent vectors even while the model outputs the wrong answer, suggesting the adversarial signal lies in the collective trajectory rather than any single embedding.
Related Articles

Black Hat Asia
AI Business
Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama
Dev.to
How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally
Dev.to
Why the same codebase should always produce the same audit score
Dev.to
Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)
Dev.to