Sampling for Quality：逐次モンテカルロによる学習不要の報酬ガイドLLMデコーディング

arXiv cs.LG / 2026/4/21

📰 ニュースDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

要点

本論文は、トークン単位の尤度ではなくシーケンス単位の品質を最適化するために、報酬を組み込んだターゲット分布を用いる「学習不要」の報酬ガイド型デコーディング手法を提案している。
モデルの遷移確率に、接頭辞に依存する報酬ポテンシャルを組み合わせることで、推論時のサンプリングだけで効果を得られ、モデル重みは一切変更しない。
サンプリングには逐次モンテカルロ（SMC）を用い、計算効率の高いprefix-only版と、完全な分布の厳密なマージナルに一致するルックアヘッド版を開発している。
resample-move更新にMetropolis-Hastingsの再活性化（rejuvenation）を統合し、さらにブロック単位の生成にも対応することで、温度サンプリングやpower-tempered目的など一般的なデコーディングを包含する。
3つの7Bモデルでの実験ではHumanEvalとMATH500で大きな改善が示され、HumanEvalで基準性能に対して最大+54.9%、報告されたスコアでは強化学習手法GRPOを上回る結果が得られている。

Abstract

We introduce a principled probabilistic framework for reward-guided decoding in large language models, addressing the limitations of standard decoding methods that optimize token-level likelihood rather than sequence-level quality. Our method defines a reward-augmented target distribution over complete sequences by combining model transition probabilities with prefix-dependent reward potentials. Importantly, the approach is training-free: it leaves model weights unchanged and instead modifies the inference distribution via reward potentials, with all gains arising purely from inference-time sampling. To sample from this distribution, we develop Sequential Monte Carlo algorithms, including a computationally efficient prefix-only variant and a lookahead variant whose intermediate targets match the exact marginals of the full sequence distribution. The framework also integrates resample-move updates with Metropolis-Hastings rejuvenation and supports block-wise generation, subsuming common decoding strategies such as temperature sampling and power-tempered objectives. Empirical results across three 7B models show significant gains. On code generation (HumanEval), our method improves base performance by up to 54.9% and surpasses the strongest sampling baselines by 9.1%-15.3%. On mathematical reasoning (MATH500), it achieves gains of up to 8.8%. Notably, it reaches 87.8% on HumanEval and 78.4% on MATH500 with Qwen2.5-7B, consistently outperforming the reinforcement learning method GRPO.

【特集】DSSver.2.0｜経産省・IPAが描くAX時代のDX人材17ロール

Innovatopia

“Mythos級”AI到来に備え、自民党が日本版「Project Glasswing」組成を検討

ITmedia AI+

名作の結末を”AI改変”、「マハーバーラタ」の“AI映像化”も──AI活用に野心燃やすインド映画界のいま

ITmedia AI+

複数のグラフを1つのAxesに表示しよう〜初心者向けMatplotlib講座 #6〜

Qiita

法務の審査時間を40%削減ーClaudeと「契約データベース」をつなぐと何が変わるのか

note

Sampling for Quality：逐次モンテカルロによる学習不要の報酬ガイドLLMデコーディング

要点

Abstract

関連記事

【特集】DSSver.2.0｜経産省・IPAが描くAX時代のDX人材17ロール

“Mythos級”AI到来に備え、自民党が日本版「Project Glasswing」組成を検討

名作の結末を”AI改変”、「マハーバーラタ」の“AI映像化”も──AI活用に野心燃やすインド映画界のいま

複数のグラフを1つのAxesに表示しよう〜初心者向けMatplotlib講座 #6〜

法務の審査時間を40%削減ーClaudeと「契約データベース」をつなぐと何が変わるのか

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer