Predicting Where Steering Vectors Succeed

arXiv cs.LG / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces the Linear Accessibility Profile (LAP), a per-layer diagnostic that predicts when steering vectors will work for a given concept and layer without requiring any training.
LAP leverages the logit lens idea by applying a model’s unembedding matrix to intermediate hidden states and uses the score A_lin as the key predictor of steering effectiveness.
Experiments across 24 binary concept families and five models (Pythia-2.8B to Llama-8B) show strong correlations between LAP scores and steering success (with effectiveness correlation around ρ = +0.86 to +0.91) as well as layer choice (ρ = +0.63 to +0.92).
The authors propose a three-regime framework explaining when linear “difference-of-means” steering suffices, when nonlinear approaches are required, and when no steering method is likely to work.
An end-to-end entity-steering demo validates the approach: steering at the LAP-recommended layer changes outputs for Gemma-2-2B and OLMo-2-1B-Instruct, while the usual middle-layer heuristic has no effect.

Abstract

Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure,

A_{\mathrm{lin}}

, applies the model's unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak

A_{\mathrm{lin}}

predicts steering effectiveness at

\rho = +0.86

+0.91

and layer selection at

\rho = +0.63

+0.92

. A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the LAP-recommended layer redirects completions on Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer (the standard heuristic) has no effect on either model.

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Dev.to

Predicting Where Steering Vectors Succeed

Key Points

Abstract

Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer