Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

arXiv cs.LG / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that steering vectors for LLM alignment should not assume a single fixed intervention layer, because the layers encoding representations relevant to a target behavior can vary by input.
It provides theoretical and empirical evidence that the optimal steering layer differs substantially across inputs and can affect alignment effectiveness.
The authors introduce “Where to Steer (W2S),” a framework that learns an input-conditioned mapping from input embeddings to the best steering layer.
Experiments across multiple LLMs and different alignment behaviors show W2S improves over fixed-layer steering baselines in both in-distribution and out-of-distribution settings.
The work reframes adaptive, input-dependent layer selection as a missing design dimension in current steering-vector alignment methods.

Abstract

Steering vectors have emerged as a lightweight and effective approach for aligning large language models (LLMs) at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework that adaptively selects the intervention layer conditioned on the input, by learning a mapping from input embeddings to optimal steering layers. Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings. Our findings highlight the importance of input-dependent control in LLM alignment and demonstrate that adaptive layer selection is a key design dimension missing in the current methodology of steering vectors.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer