$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models

arXiv cs.LG / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces S^3 (Stratified Scaling Search) as a test-time scaling method for diffusion language models that improves output quality by reallocating inference compute during denoising rather than only using best-of-K at the end.
It expands multiple candidate denoising trajectories at each step, scores them with a lightweight reference-free verifier, and selectively resamples promising candidates while maintaining diversity in the search frontier.
The method approximates a reward-tilted sampling distribution that increases the likelihood of higher-quality outputs while staying anchored to the original model prior.
Experiments on LLaDA-8B-Instruct across MATH-500, GSM8K, ARC-Challenge, and TruthfulQA show consistent gains, with the largest improvements on mathematical reasoning tasks, without changing the underlying model or the decoding schedule.
The results suggest classical verifier-guided search over denoising trajectories is a practical mechanism for test-time scaling in diffusion language models.

Abstract

Test-time scaling investigates whether a fixed diffusion language model (DLM) can generate better outputs when given more inference compute, without additional training. However, naive best-of-

K

sampling is fundamentally limited because it repeatedly draws from the same base diffusion distribution, whose high-probability regions are often misaligned with high-quality outputs. We propose

S^3

(Stratified Scaling Search), a classical verifier-guided search method that improves generation by reallocating compute during the denoising process rather than only at the final output stage. At each denoising step,

S^3

expands multiple candidate trajectories, evaluates them with a lightweight reference-free verifier, and selectively resamples promising candidates while preserving diversity within the search frontier. This procedure effectively approximates a reward-tilted sampling distribution that favors higher-quality outputs while remaining anchored to the model prior. Experiments with LLaDA-8B-Instruct on MATH-500, GSM8K, ARC-Challenge, and TruthfulQA demonstrate that

S^3

consistently improves performance across benchmarks, achieving the largest gains on mathematical reasoning tasks while leaving the underlying model and decoding schedule unchanged. These results show that classical search over denoising trajectories provides a practical mechanism for test-time scaling in DLMs.

Black Hat Asia

AI Business

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Reddit r/MachineLearning

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Dev.to

Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence

Dev.to

Stop waiting for Java to rebuild! AI IDEs + Zero-Latency Hot Reload = Magic

Dev.to

$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models

Key Points

Abstract

Related Articles

Black Hat Asia

[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]

Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds

Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence

Stop waiting for Java to rebuild! AI IDEs + Zero-Latency Hot Reload = Magic

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer