SteerRM: Debiasing Reward Models via Sparse Autoencoders
arXiv cs.CL / 3/16/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- SteerRM introduces a training-free method for debiasing reward models by applying sparse autoencoder (SAE) interventions at inference time to suppress bias features.
- It identifies bias-related SAE features using a strength-stability criterion on contrastive paired responses, enabling targeted suppression of superficial stylistic cues.
- The approach improves Hard-split accuracy by an average of 7.3 points across six reward models on RM-Bench while preserving overall performance, and generalizes to a Gemma-based RM and other bias types.
- Findings show that format-related bias features are concentrated in shallow layers and transfer across models, indicating shared architecture-level bias encoding patterns.
- SteerRM provides a practical, interpretable solution for alignment pipelines without retraining, reducing deployment friction for debiasing in RM systems.
Related Articles

I built an autonomous AI Courtroom using Llama 3.1 8B and CrewAI running 100% locally on my 5070 Ti. The agents debate each other through contextual collaboration.
Reddit r/LocalLLaMA
The Honest Guide to AI Writing Tools in 2026 (What Actually Works)
Dev.to
The Honest Guide to AI Writing Tools in 2026 (What Actually Works)
Dev.to
AI Cybersecurity
Dev.to
Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization
Dev.to