Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

arXiv cs.LG / 4/22/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper proposes improving activation steering for LLMs by treating inference as a control problem with online (feedback) error correction rather than relying on non-anticipative open-loop interventions.
It finds that, although transformer blocks are nonlinear, their layer-wise dynamics across multiple architectures and model scales are well-approximated by locally linear models, enabling a linear time-varying formulation.
Using layer-wise Jacobians, the authors adapt the Linear Quadratic Regulator (LQR) to compute feedback controllers that steer activations toward target semantic setpoints with low computational overhead and no offline training.
The method includes theoretical tracking-error bounds and introduces an adaptive semantic feature setpoint signal, leading to robust, fine-grained control across tasks.
Experiments report state-of-the-art modulation of behaviors such as toxicity, truthfulness, refusals, and steering toward arbitrary concepts, and the authors release accompanying code on GitHub.

Abstract

Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering

Black Hat USA

AI Business

Autoencoders and Representation Learning in Vision

Dev.to

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Dev.to

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

Key Points

Abstract

Related Articles

Black Hat USA

Autoencoders and Representation Learning in Vision

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer