Backdoor Directions in Vision Transformers
arXiv cs.CV / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies a specific 'trigger direction' in Vision Transformer activations that encodes the internal representation of a backdoor when a trigger is present.
- It demonstrates the causal role of this direction by showing that interventions in both activation and parameter space consistently modulate backdoor behavior across multiple datasets and attack types.
- The trigger direction is used as a diagnostic tool to trace how backdoor features are processed across layers, revealing distinct logic for static-patch versus stealthy distributed triggers.
- The study examines the link between backdoors and adversarial attacks, testing whether PGD-based perturbations can (de-)activate the identified trigger mechanism.
- It proposes a data-free, weight-based detection scheme for stealthy-trigger attacks, illustrating how mechanistic interpretability can diagnose and address security vulnerabilities in computer vision.
Related Articles

Interactive Web Visualization of GPT-2
Reddit r/artificial
Stop Treating AI Interview Fraud Like a Proctoring Problem
Dev.to
[R] Causal self-attention as a probabilistic model over embeddings
Reddit r/MachineLearning
The 5 software development trends that actually matter in 2026 (and what they mean for your startup)
Dev.to
InVideo AI Review: Fast Finished
Dev.to