Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo
arXiv cs.LG / 4/21/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a training-free, reward-guided decoding framework that optimizes sequence-level quality rather than token-level likelihood by defining a reward-augmented target distribution over full sequences.
- It constructs this distribution using the model’s transition probabilities combined with prefix-dependent reward potentials, enabling inference-time sampling without changing model weights.
- The authors develop Sequential Monte Carlo (SMC) sampling methods, including a computationally efficient prefix-only variant and a lookahead variant that matches exact marginals of the full sequence distribution.
- The framework supports resample-move updates with Metropolis-Hastings rejuvenation and block-wise generation, and it generalizes common decoding approaches like temperature sampling and power-tempered objectives.
- Experiments on three 7B models show substantial improvements on HumanEval and MATH500, including up to +54.9% over the base on HumanEval and outperforming the RL method GRPO on reported scores.
Related Articles
A practical guide to getting comfortable with AI coding tools
Dev.to

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA
We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to
Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to
🚀 Major BrowserAct CLI Update
Dev.to