reward-lens: A Mechanistic Interpretability Library for Reward Models

arXiv cs.AI / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper introduces “reward-lens,” an open-source mechanistic interpretability library that adapts common interpretability techniques (logit lens, attribution, activation patching, sparse autoencoders) to reward models used in RLHF.
  • It argues that interpretability for reward models should be centered on the reward head’s weight vector, w_r, as the natural axis for analyzing what the model “scores” and why.
  • The library includes multiple analysis tools such as Reward Lens, component attribution, three-mode activation patching, reward-hacking probes, TopK SAE feature attribution, and cross-model comparison, plus five theory-grounded extensions.
  • The authors validate the framework on two production reward models using ~695 RewardBench pairs and find that linear attribution poorly predicts causal patching effects, showing a consistent negative Spearman correlation.
  • Rather than treating the mismatch as a flaw, the work frames disagreement between observational and causal measures as an informative property and designs the toolkit to compare them directly.

Abstract

Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit -- logit lens, direct logit attribution, activation patching, sparse autoencoders -- was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head's weight vector w_r is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis, concept-vector analysis). A ten-method adapter protocol covers Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman \rho = -0.256 on Skywork, -0.027 on ArmoRM). The framework treats this disagreement as a property to expose, not a bug -- motivating a design that keeps observational and causal views first-class and directly comparable.