Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees

arXiv cs.LG / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that self-consistency methods for LLM inference can be compute-inefficient in domains like math and code because sampling with replacement repeatedly revisits the same prefixes and produces duplicate completions.
It introduces Distinct Leaf Enumeration (DLE), a deterministic decoding approach that views truncated sampling as traversing a pruned decoding tree and enumerates distinct leaves to avoid redundant sampling.
DLE improves efficiency by increasing coverage of the truncated search space within the same compute budget and by reusing shared prefixes to reduce unnecessary token generation.
Experiments show that DLE can produce higher-quality reasoning traces than stochastic self-consistency, improving performance across math, coding, and general reasoning tasks.
The work presents DLE as a practical alternative to sampling-based self-consistency when compute budgets are limited and diversity of completions matters.

Abstract

Self-consistency boosts inference-time performance by sampling multiple reasoning traces in parallel and voting. However, in constrained domains like math and code, this strategy is compute-inefficient because it samples with replacement, repeatedly revisiting the same high-probability prefixes and duplicate completions. We propose Distinct Leaf Enumeration (DLE), a deterministic decoding method that treats truncated sampling as traversal of a pruned decoding tree and systematically enumerates distinct leaves instead of sampling with replacement. This strategy improves inference efficiency in two ways. Algorithmically, it increases coverage of the truncated search space under a fixed budget by exploring previously unvisited high-probability branches. Systemically, it reuses shared prefixes and reduces redundant token generation. Empirically, DLE explores higher-quality reasoning traces than stochastic self-consistency, yielding better performance on math, coding, and general reasoning tasks.