Tiny Inference-Time Scaling with Latent Verifiers

arXiv cs.CV / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Verifier on Hidden States (VHS), an inference-time verifier for diffusion transformer (DiT) generators that evaluates intermediate hidden representations instead of decoding candidates to pixel space.
  • By avoiding redundant pixel-space decoding and re-encoding into multimodal embedding spaces, VHS substantially lowers the per-candidate verification cost compared with MLLM-based verifiers.
  • Experiments under small “tiny inference budgets” show VHS improves or matches MLLM verifier performance while reducing joint generation-and-verification time by 63.3%, FLOPs by 51%, and VRAM usage by 14.5%.
  • At the same inference-time budget, VHS achieves a +2.7% improvement on GenEval, suggesting efficient test-time scaling can be achieved without heavy multimodal verifier overhead.

Abstract

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.