Predicting Where Steering Vectors Succeed

arXiv cs.LG / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the Linear Accessibility Profile (LAP), a per-layer diagnostic that predicts when steering vectors will work for a given concept and layer without requiring any training.
  • LAP leverages the logit lens idea by applying a model’s unembedding matrix to intermediate hidden states and uses the score A_lin as the key predictor of steering effectiveness.
  • Experiments across 24 binary concept families and five models (Pythia-2.8B to Llama-8B) show strong correlations between LAP scores and steering success (with effectiveness correlation around ρ = +0.86 to +0.91) as well as layer choice (ρ = +0.63 to +0.92).
  • The authors propose a three-regime framework explaining when linear “difference-of-means” steering suffices, when nonlinear approaches are required, and when no steering method is likely to work.
  • An end-to-end entity-steering demo validates the approach: steering at the LAP-recommended layer changes outputs for Gemma-2-2B and OLMo-2-1B-Instruct, while the usual middle-layer heuristic has no effect.

Abstract

Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure, A_{\mathrm{lin}}, applies the model's unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak A_{\mathrm{lin}} predicts steering effectiveness at \rho = +0.86 to +0.91 and layer selection at \rho = +0.63 to +0.92. A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the LAP-recommended layer redirects completions on Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer (the standard heuristic) has no effect on either model.