Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

arXiv cs.LG / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces “ThoughtSteer,” a backdoor attack on continuous latent-reasoning language models that produces hijacked outputs while emitting no token-level trace of the manipulation.
By perturbing a single input-layer embedding vector, the attacker leverages the model’s own multi-pass latent reasoning to amplify the change into a controlled latent trajectory that yields the attacker’s chosen answer.
Experiments across two model architectures (Coconut, SimCoT), three reasoning benchmarks, and model sizes from 124M to 3B show ≥99% attack success with near-baseline clean accuracy, strong transfer to held-out benchmarks (94–100%), and evasion of five evaluated active defenses.
The work attributes failures of token-level defenses to a latent-space phenomenon (“Neural Collapse”) that forces representations toward a geometric attractor, and it claims effective backdoors must have a linearly separable signature (probe AUC ≥ 0.999).
The authors highlight a mechanistic interpretability paradox: correct answer information can still be present in individual latent vectors even while the model outputs the wrong answer, suggesting the adversarial signal lies in the collective trajectory rather than any single embedding.

Abstract

A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.

Black Hat Asia

AI Business

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama

Dev.to

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally

Dev.to

Why the same codebase should always produce the same audit score

Dev.to

Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)

Dev.to

Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

Key Points

Abstract

Related Articles

Black Hat Asia

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally

Why the same codebase should always produce the same audit score

Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer