SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

arXiv cs.CL / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • SpecTr-GBV is a new speculative decoding method that combines multi-draft strategies with greedy block verification into a unified framework, rather than treating them as separate improvements.
  • The paper formulates the verification step as an optimal transport problem over draft and target token blocks, aiming to improve both theoretical efficiency and practical results.
  • The authors theoretically prove that SpecTr-GBV reaches the optimal expected acceptance length achievable under i.i.d. draft generation, and show this bound improves as the number of drafts increases.
  • Experiments on five datasets against four baselines show better speedups and higher block efficiency while maintaining output quality, with ablation studies analyzing the impact of key hyperparameters.

Abstract

Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are selectively verified by a larger target model. While existing methods either adopt multi-draft strategies to increase acceptance rates or block verification techniques to jointly verify multiple tokens, they remain limited by treating these improvements in isolation. In this work, we propose SpecTr-GBV, a novel SD method that unifies multi-draft and greedy block verification (GBV) into a single framework. By formulating the verification step as an optimal transport problem over draft and target token blocks, SpecTr-GBV improves both theoretical efficiency and empirical performance. We theoretically prove that SpecTr-GBV achieves the optimal expected acceptance length physically attainable within the framework of i.i.d. draft generation, and this bound improves as the number of drafts increases. Empirically, we evaluate SpecTr-GBV across five datasets and four baselines. Our method achieves superior speedup and significantly higher block efficiency while preserving output quality. In addition, we perform comprehensive ablation studies to evaluate the impact of various hyperparameters in the model.