Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo

arXiv cs.LG / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a training-free, reward-guided decoding framework that optimizes sequence-level quality rather than token-level likelihood by defining a reward-augmented target distribution over full sequences.
It constructs this distribution using the model’s transition probabilities combined with prefix-dependent reward potentials, enabling inference-time sampling without changing model weights.
The authors develop Sequential Monte Carlo (SMC) sampling methods, including a computationally efficient prefix-only variant and a lookahead variant that matches exact marginals of the full sequence distribution.
The framework supports resample-move updates with Metropolis-Hastings rejuvenation and block-wise generation, and it generalizes common decoding approaches like temperature sampling and power-tempered objectives.
Experiments on three 7B models show substantial improvements on HumanEval and MATH500, including up to +54.9% over the base on HumanEval and outperforming the RL method GRPO on reported scores.

Abstract

We introduce a principled probabilistic framework for reward-guided decoding in large language models, addressing the limitations of standard decoding methods that optimize token-level likelihood rather than sequence-level quality. Our method defines a reward-augmented target distribution over complete sequences by combining model transition probabilities with prefix-dependent reward potentials. Importantly, the approach is training-free: it leaves model weights unchanged and instead modifies the inference distribution via reward potentials, with all gains arising purely from inference-time sampling. To sample from this distribution, we develop Sequential Monte Carlo algorithms, including a computationally efficient prefix-only variant and a lookahead variant whose intermediate targets match the exact marginals of the full sequence distribution. The framework also integrates resample-move updates with Metropolis-Hastings rejuvenation and supports block-wise generation, subsuming common decoding strategies such as temperature sampling and power-tempered objectives. Empirical results across three 7B models show significant gains. On code generation (HumanEval), our method improves base performance by up to 54.9% and surpasses the strongest sampling baselines by 9.1%-15.3%. On mathematical reasoning (MATH500), it achieves gains of up to 8.8%. Notably, it reaches 87.8% on HumanEval and 78.4% on MATH500 with Qwen2.5-7B, consistently outperforming the reinforcement learning method GRPO.

A practical guide to getting comfortable with AI coding tools

Dev.to

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

🚀 Major BrowserAct CLI Update

Dev.to

Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo

Key Points

Abstract

Related Articles

A practical guide to getting comfortable with AI coding tools

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

🚀 Major BrowserAct CLI Update

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer