SteerRM: Debiasing Reward Models via Sparse Autoencoders
arXiv cs.CL / 3/16/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- SteerRM introduces a training-free method for debiasing reward models by applying sparse autoencoder (SAE) interventions at inference time to suppress bias features.
- It identifies bias-related SAE features using a strength-stability criterion on contrastive paired responses, enabling targeted suppression of superficial stylistic cues.
- The approach improves Hard-split accuracy by an average of 7.3 points across six reward models on RM-Bench while preserving overall performance, and generalizes to a Gemma-based RM and other bias types.
- Findings show that format-related bias features are concentrated in shallow layers and transfer across models, indicating shared architecture-level bias encoding patterns.
- SteerRM provides a practical, interpretable solution for alignment pipelines without retraining, reducing deployment friction for debiasing in RM systems.



