Debiasing Reward Models via Causally Motivated Inference-Time Intervention

arXiv cs.CL / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Reward models (RMs) used for aligning LLMs can latch onto spurious cues like response length, and common inference-time fixes that target only length can cause trade-offs in overall performance.
  • The paper proposes causally motivated inference-time intervention that finds neurons correlated with multiple predefined bias attributes and suppresses their activations at the neuron level.
  • Experiments on RM benchmarks show reduced sensitivity to spurious features across different bias types without degrading performance.
  • For preference annotation, small 2B/7B RMs modified by intervening on fewer than 2% of neurons enable LLMs to reach alignment quality comparable to a state-of-the-art 70B RM on AlpacaEval and MT-Bench.
  • Additional analysis suggests bias signals are mainly encoded in neurons in early layers, offering insight into how RMs exploit biases internally.

Abstract

Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often sensitive to spurious features such as response length. Existing inference-time approaches for mitigating these biases typically focus exclusively on response length, resulting in performance trade-offs. In this paper, we propose causally motivated intervention for mitigating multiple types of biases in RMs at inference time. Our method first identifies neurons whose activations are strongly correlated with predefined bias attributes, and applies neuron-level intervention that suppresses these signals. We evaluate our method on RM benchmarks and observe reductions in sensitivity to spurious features across diverse bias types, without inducing performance trade-offs. Moreover, when used for preference annotation, small RMs (2B and 7B) with our method, which edits less than 2% of all the neurons in RMs, enable LLMs to improve alignment, achieving performance comparable to that of a state-of-the-art 70B RM on AlpacaEval and MT-Bench. Further analysis reveals that bias signals are primarily encoded by neurons in early layers, shedding light on the internal mechanisms of bias exploitation in RMs.