Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
arXiv cs.CL / 5/5/2026
📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The paper introduces “decoding-time debiasing” that mitigates social bias in large language models without retraining or fine-tuning, by searching over candidate tokens during generation.
- A separate Process Reward Model (PRM) is used to score candidates on both fairness and fluency, enabling debiasing through reranking/critique strategies rather than weight updates.
- Three increasingly advanced decoding schemes are proposed—Best-of-N selection, Sequential critique-and-revise, and Constitutional self-audit—and sequential debiasing achieves up to +0.40 improvement in mean bias scores while largely preserving or improving fluency.
- The approach is extended to open-ended generation with on-the-fly token debiasing and a lightweight “Bias Guard” gate that selectively triggers to keep compute overhead near 2x for well-calibrated models.
- Experiments on four models across an English–Urdu benchmark covering eight bias categories show the framework scales with model capability and help pinpoint where smaller open-weight LLMs still underperform.
Related Articles

Black Hat USA
AI Business

Backed by Y Combinator and 20 unicorn founders, Moritz lands $9M
Tech.eu

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Anthropic Launches AI Services Company with Blackstone & Goldman Sachs
Dev.to

Why B2B Revenue-Recovery Casework Looks Like AgentHansa's Best Early PMF
Dev.to