Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
SD uses a lightweight draft model to speculate multiple future tokens, which are then verified in parallel by the target model. This way, SD can significantly improve throughput while preserving the exact output distribution of the target model.
Despite rapid progress in SD algorithms, their evaluation remains fragmented and often unrepresentative of real-world data and serving conditions.
In practice, SD speculation quality and inference speedups are inherently data-dependent, serving-regime–dependent, and system-dependent.
Yet most existing benchmarks rely on small prompt sets, limited semantic diversity, short input sequence lengths, batch size one, or high-level inference stacks that do not reflect production environments.
To address these gaps, we introduce SPEED-Bench: a unified benchmark designed to evaluate SD across diverse semantic domains and realistic serving regimes, using production-grade inference engines.
What is SPEED-Bench?
SD must be evaluated from two perspectives.
On one hand, draft quality depends on the semantic domain and entropy of the input text.
On the other hand, real-world speedups depend on batch size, input sequence length (ISL), and system constraints, which determine whether inference is memory-bound or compute-bound.
SPEED-Bench therefore introduces a benchmarking ecosystem for SD.
It combines two purpose-built dataset splits and a unified measurement framework, each designed to capture a different aspect of SD behavior:
- A “Qualitative” data split, optimized for semantic diversity and designed to measure speculation quality (drafter accuracy) across domains.
- A “Throughput” data split, constructed to evaluate system-level speedups across various input sequence lengths and high concurrency.
- A unified measurement framework, integrated with production inference engines, that standardizes evaluation across systems.
Together, these components enable practitioners and researchers to analyze SD behavior that is often masked by existing benchmarks.
Figure 1 provides a high-level overview of the SPEED-Bench ecosystem.
The Qualitative split: semantic coverage and draft accuracy
The goal of the Qualitative split is to measure speculative decoding quality, specifically conditional acceptance rates (ARs) and acceptance lengths (ALs), across a wide range of semantic domains.
SpecBench introduced the first unified SD benchmark across diverse application scenarios, such as multi-turn conversation, translation, and mathematical reasoning, by aggregating instances from widely used datasets into a unified testing environment. However, despite being a significant step toward standardized evaluations, it has critical limitations regarding scale and diversity. Most categories contain as few as 10 samples with short mean input lengths (< 100 tokens) that may fail to stress modern drafters. Additionally, some of its categories often lack structural diversity, such as the multilingual category consisting entirely of German-to-English translation prompts.
While extensive evaluation across numerous datasets is theoretically possible, it is tedious, impractical for rapid experimentation, and hinders direct comparisons between different research groups releasing SD algorithms and models. Instead of relying on exhaustive evaluations across disparate datasets, we curate a compact yet highly representative subset designed to maximize semantic diversity.
We aggregate data from 18 publicly available sources and organize it into 11 categories, including Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning, and QA.
Each category contains 80 samples, resulting in a total of 880 prompts.
Unlike prior benchmarks, which often suffer from low intra-category diversity, the SPEED-Bench Qualitative split explicitly prioritizes semantic diversity.
To achieve this, each candidate prompt is embedded into a dense vector space using a pretrained text embedder (openai/text-embedding-3-small).
We then apply a selection algorithm that minimizes average pairwise cosine similarity within each category.
This ensures that the selected samples span the semantic space as widely as possible, reducing redundancy and increasing evaluation fidelity.
The effectiveness of this approach is shown in Figure 2, which compares average semantic similarity across SPEED-Bench and SpecBench.
This semantic diversity is critical for exposing domain-dependent behavior in SD, such as the strong contrast between low-entropy domains (e.g., Coding, Math) and high-entropy domains (e.g., Roleplay, Writing).
The Throughput split: realistic serving workloads
While the Qualitative split captures draft accuracy, it is insufficient for evaluating system-level speedups.
We evaluate system-level speedups using two metrics: Throughput (Output TPS), the total tokens generated per second across all concurrent requests, and User TPS, the per-request token generation rate. User TPS acts as a proxy for end-user latency.
In production environments, models are served under high concurrency and a wide range of input sequence lengths, which are often much longer than the short ISL samples used in many SD benchmarks.
As batch size increases, inference often transitions from a compute-bound regime to a memory-bound regime, fundamentally changing the cost-benefit trade-offs of speculative decoding.
The Throughput split is designed specifically to capture this behavior.
We construct fixed ISL buckets ranging from 1k to 32k tokens, reflecting the growing importance of long-context applications such as coding assistants and retrieval-augmented generation.
For each ISL bucket, prompts are aggregated into three coarse difficulty categories corresponding to low-, mixed-, and high-entropy domains.
Each ISL bucket contains 1,536 prompts (512 per difficulty category), providing sufficient volume to construct stable throughput Pareto curves across a wide range of batch sizes (Figure 3).
To ensure deterministic prefill cost, prompts are either truncated or padded in a controlled manner, while preserving their semantic content.
Importantly, SPEED-Bench avoids the use of random token inputs for throughput benchmarking.
As we show later, random tokens can severely distort acceptance behavior, expert routing in MoE models, and throughput measurements, leading to overly optimistic conclusions.
A unified measurement framework
Benchmarking SD across inference engines presents a subtle but critical challenge.
Different engines may apply different chat templates, handle BOS tokens differently, or tokenize inputs inconsistently.
These differences can silently alter the drafted sequence, making cross-engine comparisons unreliable.
SPEED-Bench introduces a lightweight measurement framework that handles tokenization and prompt formatting externally.
Inference engines receive pre-tokenized sequences, ensuring that all systems process identical inputs.
The framework integrates with production-grade engines: TensorRT-LLM, vLLM, and SGLang.
It captures fine-grained timing information from streaming responses to compute acceptance behavior, step latency, user-level tokens-per-second, and overall throughput.
Below is an example of running our measurement framework on Llama 3.3 70B Instruct as the target model with EAGLE3 as the draft model on the Qualitative split of SPEED-Bench, using TensorRT-LLM with a batch size of 32 (8*H100 GPUs):
Example output of the measurement framework
bash-5.2$ mpirun -n 1 --oversubscribe python3 run.py --model_dir meta-llama/Llama-3.3-70B-Instruct --tokenizer meta-llama/Llama-3.3-70B-Instruct --draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B --dataset speed --dataset_path data/speed/qualitative --tp_size 8 --ep_size 1 --draft_length 3 --output_length 4096 --engine TRTLLM --concurrency 32 --show_progress
...
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc1
...
Running requests (concurrency=32): 100%|██████████| 880/880 [02:58<00:00, 4.93it/s]
...
Acceptance Length Histogram
{1: 57385, 2: 36968, 3: 24441, 4: 61182}
Conditional acceptance rate
1 1.0
2 0.681151931368627
3 0.6984444208791836
4 0.7145509968116043
Acceptance Rate Results
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Category ┃ Average AR ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ coding │ 3.0001 │
│ humanities │ 2.3699 │
│ math │ 2.4710 │
│ multilingual │ 1.7277 │
│ qa │ 2.3184 │
│ rag │ 2.7502 │
│ reasoning │ 2.6142 │
│ roleplay │ 2.0407 │
│ stem │ 2.4306 │
│ summarization │ 2.6026 │
│ writing │ 2.6364 │
├─────────────────┼────────────┤
│ Overall Average │ 2.4511 │
└─────────────────┴────────────┘
Output TPS 2518.1464055829188
Output TPS/gpu 314.76830069786485
E2E Request Time {'min': '0.1031', 'max': '51.3442', 'mean': '4.7313', 'std': '4.8872', 'quantiles': {'0.25': '1.6972', '0.5': '3.5459', '0.75': '6.1715'}}
TTFT Time {'min': '0.0528', 'max': '0.8010', 'mean': '0.1217', 'std': '0.1133', 'quantiles': {'0.25': '0.0841', '0.5': '0.0982', '0.75': '0.1139'}}
Request Generation Step Time {'min': '0.0000', 'max': '0.8010', 'mean': '0.0299', 'std': '0.0173', 'quantiles': {'0.25': '0.0259', '0.5': '0.0267', '0.75': '0.0280'}}
Request Generation Tokens Per Second {'min': '35.1116', 'max': '142.9547', 'mean': '85.3162', 'std': '17.4823', 'quantiles': {'0.25': '75.0608', '0.5': '84.9756', '0.75': '97.0815'}}
Number of Output Tokens {'min': '3.0000', 'max': '4096.0000', 'mean': '394.8787', 'std': '452.7451', 'quantiles': {'0.25': '123.0000', '0.5': '281.0000', '0.75': '518.2500'}}
This design isolates the effects of SD algorithms and system optimizations from preprocessing artifacts.
Insights from SPEED-Bench
Domain-dependent accuracy and speedups
The table below reports average acceptance lengths and speedups across domains and models at a realistic batch size (32) and a draft length of 3.
The results confirm that SD acceptance length is highly domain-dependent.
Low-entropy domains such as Coding and Math consistently yield higher acceptance lengths, while high-entropy tasks such as Roleplay and Writing are more difficult to speculate on.
The table also highlights differences between speculation methods.
Lightweight approaches such as N-Gram speculation can result in net slowdowns at moderate batch sizes. We further see that native MTP heads achieve significantly larger ALs than post-trained alternatives like EAGLE3, highlighting the benefit of co-training the base model and drafter from scratch.
| Domain | Llama 3.3 70B with N-Gram (TensorRT-LLM) | GPT OSS 120B with EAGLE3 (TensorRT-LLM) | Qwen3-Next with MTP (SGLang) |
|---|---|---|---|
| Coding | 1.54 | 2.46 | 3.34 |
| Math | 1.43 | 2.46 | 3.13 |
| Roleplay | 1.15 | 1.87 | 2.09 |
| Writing | 1.33 | 1.98 | 2.46 |
| Mean AL | 1.41 | 2.25 | 2.81 |
| Mean Speedup | 0.88x | 1.34x | 1.20x |
Full results on the Qualitative split of all categories and different models are in our paper.
Vocabulary pruning reveals long-tail failures
SPEED-Bench can also assist with exposing side effects of aggressive system optimizations.
Vocabulary pruning is used in EAGLE3 to reduce the computational cost of the final projection layer.
While effective on narrow domains, this optimization can degrade acceptance length on the “long tail” of user inputs.
Figure 4 shows acceptance length across domains when using full vs. pruned vocabularies for GPT-OSS 120B with EAGLE3.
The impact is minimal in Coding and Math, but substantial in Multilingual, RAG, and Summarization categories.
These effects are largely invisible in low-diversity benchmarks, underscoring the importance of broad semantic coverage in the evaluation data.
Random tokens overestimate throughput
A common practice in inference benchmarking is to use random tokens to simulate prompt load.
While may be sufficient for autoregressive decoding, this approach is fundamentally flawed for SD algorithms and even for mixture of experts models (MoE) without speculation, as presented below.
Random tokens trigger two failure modes that skew measurements:
- Trivial Response: The model identifies noise and defaults to predictable acknowledgments, artificially inflating ALs.
Example output (Base: GPT-OSS 120b, Drafter: EAGLE3, Draft Length:3, Average AL: 3.44):
It looks like you’ve pasted a very long block of mixed‑language text that doesn’t form a clear question or request. I’m happy to help, but I need a bit more guidance.
Could you let me know what you’d like to do with this text? For example: ...
- Topic Latching: The model anchors to specific keywords within the noise and hallucinates a coherent response typically resulting in lower ALs.
Example output (Base: GPT-OSS 120b, Drafter: EAGLE3, Draft Length:3, Average AL: 1.877):
Below is an expanded, production‑ready roadmap that takes you from the very first Unity install all the way to a complete, polished 2‑D platformer (player, camera, enemies, collectibles, UI, audio, level loading, and a final build).
Everything is broken into bite‑size tasks, each with the exact actions you need to perform and ready‑to‑copy C# snippets.
...
Figure 5 compares throughput measured using random tokens vs. SPEED-Bench workloads.
When SD is enabled, random tokens overestimate throughput by approximately 23%.
Random inputs also fail to trigger realistic expert routing in MoE models, leading to inaccurate throughput measurements even in non-speculative settings, as presented in Figure 6.
Start using SPEED-Bench
SPEED-Bench is released to establish a unified standard for evaluating SD in both research and production settings.
It enables practitioners to analyze draft accuracy across diverse domains, measure throughput under realistic serving regimes, and compare inference engines using identical workloads.
The dataset and measurement framework are openly available and designed to integrate directly with existing SD implementations.
Resources
- 📄 Paper
- 🤗 Dataset
- ⚙️ Measurement framework
We hope SPEED-Bench helps drive more rigorous, realistic, and deployment-aware evaluation of speculative decoding!




