Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation

arXiv cs.CL / 5/5/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces “decoding-time debiasing” that mitigates social bias in large language models without retraining or fine-tuning, by searching over candidate tokens during generation.
A separate Process Reward Model (PRM) is used to score candidates on both fairness and fluency, enabling debiasing through reranking/critique strategies rather than weight updates.
Three increasingly advanced decoding schemes are proposed—Best-of-N selection, Sequential critique-and-revise, and Constitutional self-audit—and sequential debiasing achieves up to +0.40 improvement in mean bias scores while largely preserving or improving fluency.
The approach is extended to open-ended generation with on-the-fly token debiasing and a lightweight “Bias Guard” gate that selectively triggers to keep compute overhead near 2x for well-calibrated models.
Experiments on four models across an English–Urdu benchmark covering eight bias categories show the framework scales with model capability and help pinpoint where smaller open-weight LLMs still underperform.

Abstract

Large language models pick up social biases from the data they are trained on and carry those biases into downstream applications, often reinforcing stereotypes around gender, race, religion, disability, age, and socioeconomic status. The standard fixes (retraining on curated data or fine-tuning with human feedback) are expensive, need access to model weights, and risk degrading the model on other tasks. In this paper we take a different route: we debias the model at decoding time, treating bias mitigation as a structured search over candidate tokens without ever touching model weights. A separate Process Reward Model (PRM) acts as a judge, scoring each candidate for both fairness and fluency. We design three schemes of increasing sophistication (Best-of-N selection, Sequential critique-and-revise, and Constitutional self-audit) and evaluate them on four models (GPT-4o-mini, Llama 3.2 3B, Gemma 3 4B, Qwen 2.5 3B) across a 200-prompt bilingual benchmark in English and Urdu covering eight bias categories. Sequential debiasing proves the most effective, raising mean bias scores by up to +0.40 over baseline while preserving (and sometimes improving) fluency. We then extend all three schemes to open-ended generation, where each token is debiased on the fly, and introduce a lightweight Bias Guard gate that fires only on potentially biased words, keeping overhead near 2x for well-calibrated models. A formal overhead metric that separates generator cost from judge cost reveals that Best-of-N is effectively free on the generator side in a native implementation. GPT-4o-mini, included as a strong proprietary anchor, confirms that the framework scales with model capability; the three open-weight models show where current small-scale LLMs still struggle.