Activation Exposure & Feature Interpretability for GGUF via llama-server

Reddit r/LocalLLaMA / 3/20/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

A C++ patch to llama-server adds /activations endpoints to capture per-layer activation vectors during inference and stream them for offline training.
A Python SAE pipeline implements the full sparse autoencoder workflow to derive interpretable features and export them as GGUF control vectors for real-time steering.
The patch hooks into llama.cpp cb_eval to capture layer outputs (l_out) and transfers data GPU→CPU, storing activations in a simple binary format readable by NumPy.
The approach uses inter-cluster differential scoring to identify features that fire significantly for a target behavior, yielding behavior-specific control signals rather than generic features.
PRs and companion repo provide a quickstart, example clusters, and guidance for live monitoring and steering (e.g., suppressing sycophancy, amplifying creativity).

You can now capture per-layer activation vectors from llama-server during inference, train sparse autoencoders on them, discover which internal features correspond to specific behaviors (sycophancy, hedging, creativity, etc.), and extract those features as GGUF control vectors for real-time steering.

What this is:

A C++ patch to llama-server that adds `/activations` endpoints, plus a Python pipeline for the full SAE workflow. The patch is ~400 lines across 5 files and adds:

`GET /activations`: query per-layer mean activations (with top-K filtering)
`POST /activations`: enable/disable capture
`POST /activations/collect`: stream full per-token vectors to a binary file for offline training

What you can do with it:

Monitor activations live: see which features fire strongest during a conversation
Collect training data: stream per-token activation vectors to disk while running inference
Train a sparse autoencoder: decompose activations into ~16K interpretable features (takes about 40 seconds on an RTX 3090)
Discover behavioral features: define phrase clusters ("sycophantic phrases", "hedging phrases", etc.) and find which features are unique to each behavior
Extract control vectors: turn discovered features into GGUF files you can load with `--control-vector-scaled`
Steer in real time: suppress sycophancy, amplify creativity, whatever you want, at the feature level

How it works technically:

The patch hooks into llama.cpp's existing `cb_eval` callback to intercept `l_out` tensors (layer outputs) during the forward pass. GPU→CPU copy via `ggml_backend_tensor_get()`, stored in a mutex-protected global struct. The binary collection format is dead simple: 16-byte header + float32 arrays, directly readable with numpy.

The SAE pipeline is standard: collect activations → train sparse autoencoder → probe features with behavioral phrase clusters → extract feature directions as control vectors. The interesting part is the inter-cluster differential scoring: instead of just finding "features that fire on sycophantic text," it finds features that fire *significantly more* on sycophantic text than on any other cluster, so you get specific behavioral features rather than generic language features.

PR + repo:

llama.cpp PR: https://github.com/ggml-org/llama.cpp/pull/20785
Companion repo with the full SAE pipeline, guide, and example clusters: https://github.com/hrhdegenetrix/llama-sae-feature-interpretability

The companion repo has a quickstart script, example behavioral cluster definitions, and a comprehensive guide covering the full workflow.

Notes:

MoE models are *extremely* sensitive to control vector scales. Dense models (Qwen3-8B, 4096 embd) handle scales of 0.15-0.6 fine. Qwen3.5-35B-A3B MoE (2048 embd) needs 0.01-0.05 or output goes garbled.
The eval callback registration had a bug where it only got set inside the graph-reuse branch: so capture silently stopped working after the first inference. Took a while to track that one down.
You need ~500K tokens of activation data for a good SAE. Harry's DPO conversations are ~14K tokens each, so 20 rows gets you there.
Persona DPO overfits by step 200 with small datasets. Step 200 was the sweet spot (~97% eval accuracy).
SAEs are not the be-all, end-all of this process and in fact are one of only several pathways to feature interpretability, but they are a simple approach and the process should be fairly adaptable.

Enjoy!

submitted by /u/wattswrites
[link] [comments]

I Built an AI That Audits Other AI Agents for Token Waste — Launching on Product Hunt Today

Dev.to

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

SYNCAI

Dev.to

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

Dev.to

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

Activation Exposure & Feature Interpretability for GGUF via llama-server

Key Points

Related Articles

I Built an AI That Audits Other AI Agents for Token Waste — Launching on Product Hunt Today

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

SYNCAI

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

When AI Grows Up: Identity, Memory, and What Persists Across Versions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer