The Cylindrical Representation Hypothesis for Language Model Steering

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper argues that existing theory for language-model steering, based on the Linear Representation Hypothesis (LRH), is insufficient because real concept representations are not perfectly orthogonal, leading to unstable and unpredictable steering effects.
It introduces the Cylindrical Representation Hypothesis (CRH), proposing that concept absence/presence is captured by a central axis, while steering sensitivity is governed by a surrounding normal plane.
CRH suggests that only particular “sensitive sectors” within the normal plane strongly enable the target concept, while other sectors can suppress or delay activation.
The authors formalize how uncertainty at the sector level arises intrinsically (even when steering directions are well aligned), offering an explanation for why steering outcomes can fluctuate.
Experiments reportedly confirm the cylindrical structure and show that CRH can be used to interpret and analyze steering behavior in practical settings, with code provided at the linked GitHub repo.

Abstract

Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: https://github.com/mbzuai-nlp/CRH.