Distributed Interpretability and Control for Large Language Models

arXiv cs.LG / 4/9/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper proposes a practical multi-GPU approach to activation-level interpretability (logit lens) and output steering (steering vectors) for large language models beyond what prior single-GPU tooling supported.
The proposed system reduces activation memory by up to 7x and increases throughput by up to 41x versus a baseline on the same hardware.
Experiments across LLaMA-3.1 (8B/70B) and Qwen-3 (4B/14B/32B) show sustained generation performance of roughly 20–100 tokens/s while collecting full layer-wise activation trajectories for 1,500-token sequences.
By injecting label-position steering vectors post-LayerNorm, the method enables controllable, monotonic output shifts with a reported mean steerability slope of 0.702, without fine-tuning or extra forward passes.
The authors release benchmarks, ablations, and a reproducible instrumentation recipe (including a GitHub repo) to support real-time behavioral control and interpretability for frontier LLMs.

Abstract

Large language models that require multiple GPU cards to host are usually the most capable models. It is necessary to understand and steer these models, but the current technologies do not support the interpretability and steering of these models in the multi-GPU setting as well as the single-GPU setting. We present a practical implementation of activation-level interpretability (logit lens) and steering (steering vector) that scales up to multi-GPU language models. Our system implements design choices that reduce the activation memory by up to 7x and increase the throughput by up to 41x compared to a baseline on identical hardware. We demonstrate the method across LLaMA-3.1 (8B, 70B) and Qwen-3 (4B, 14B, 32B), sustaining 20-100 tokens/s while collecting full layer-wise activation trajectories for sequences of 1,500 tokens. Using label-position steering vectors injected post-LayerNorm, we show controllable, monotonic shifts in model outputs with a mean steerability slope of 0.702 across evaluated datasets, without fine-tuning or additional forward passes. We release detailed benchmarks, ablations, and a reproducible instrumentation recipe to enable practical interpretability and real-time behavioral control for frontier LLMs at https://github.com/Devdesai1901/LogitLense.