Steered LLM Activations are Non-Surjective
arXiv cs.AI / 4/14/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies activation steering as a surjectivity question: whether every activation state produced by steering can be reached by some discrete text prompt through the model’s standard forward pass.
- Under practical assumptions, it proves that steering drives the residual stream off the manifold of activation states reachable from prompts, meaning most steered internal behaviors have no prompt pre-image.
- The authors report empirical evidence across three widely used LLMs that supports the theoretical non-surjectivity result.
- The findings formally separate “white-box” steerability from “black-box” prompt-based realizability, suggesting steering success should not be taken as evidence of interpretability or vulnerability via prompts.
- The work recommends evaluation protocols that explicitly decouple white-box interventions (steering) from black-box prompting when assessing interpretability and safety risks.
Related Articles

Black Hat Asia
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning
Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale
Dev.to
Bit of a strange question?
Reddit r/artificial