The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

arXiv cs.AI / 5/5/2026

📰 NewsModels & Research

Key Points

  • The paper addresses a key challenge in training-free “reasoning without training”: base LLMs already put some probability on correct multi-step solutions, but inference-time decoding must efficiently find those probability modes.
  • It proposes Auxiliary Particle Power Sampling (APPS), which approximates a sequence-level power target proportional to p_theta(x)^alpha (with alpha > 1) using a bounded number of particles in a blockwise, parallel way.
  • APPS uses proposal-corrected power reweighting and future-value-guided selection at resampling boundaries to allocate compute across competing prefixes instead of committing to one decoding path.
  • The method includes practical future-value estimates via short-horizon rollouts and an amortized variant that uses a lightweight learned selection head to reduce overhead.
  • Experiments on reasoning benchmarks show APPS improves the accuracy–runtime trade-off for training-free decoding and indicates that more of the gap to post-trained systems can be recovered via better inference-time power approximation.

Abstract

A recurring pattern in "reasoning without training" is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time. Power sampling provides a principled way to bias decoding toward such modes by targeting p_theta(x)^alpha with alpha > 1, but practical approximations must account for future-dependent correction factors that determine which prefixes remain promising. We introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for approximating the sequence-level power target with a bounded population of partial solutions. APPS propagates hypotheses in parallel using proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries. This redistributes finite compute across competing prefixes rather than committing to a single unfolding path, while providing a direct scaling knob in the particle count and predictable peak memory. We instantiate the future-value signal with short-horizon rollouts and also study an amortized variant that replaces rollouts with a lightweight learned selection head. Across reasoning benchmarks, APPS improves the accuracy-runtime trade-off of training-free decoding and suggests that part of the gap to post-trained systems can be recovered through more faithful inference-time power approximation.