Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]

Reddit r/MachineLearning / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author released an educational GitHub repository that implements multiple speculative decoding methods from scratch behind a shared decoding/evaluation contract to enable easier comparison of proposer designs.
Implemented approaches include EAGLE-3, Medusa-1, standard draft-model speculation, PARD/parallel draft models, as well as training-free n-gram prompt lookup and suffix decoding.
The repo provides both training and inference paths when applicable, using Qwen/Qwen2.5-7B-Instruct as the target model with smaller learned heads or draft models, depending on the method.
The project explicitly highlights key trade-offs such as proposer quality vs verifier cost, why higher acceptance rate doesn’t necessarily mean higher throughput, and how methods like PARD can outperform autoregressive draft modeling despite lower acceptance.
Benchmark summaries and implementation notes are included, with results intended as behavioral/implementation benchmarks on limited eval slices due to compute constraints rather than broad performance claims.

I’ve been working on an educational implementation repo for speculative decoding:

https://github.com/shreyansh26/Speculative-Decoding

The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind a shared decoding/evaluation contract so that the differences between proposer designs are easier to study.

Implemented methods so far:

EAGLE-3
Medusa-1
standard draft model speculation
PARD / parallel draft models
n-gram prompt lookup
suffix decoding

The repo has both training and inference paths where applicable. For learned proposers, I use Qwen/Qwen2.5-7B-Instruct as the target model and small learned/speculative heads or draft models, depending on the method. For training-free methods, the proposer is built from the prompt/generated context.

A few things I wanted the repo to make explicit:

The distinction between proposer quality and verifier cost.
Why a high acceptance rate does not always imply higher throughput.
Why methods like PARD can be faster despite lower acceptance than an autoregressive draft model.
How EAGLE/Medusa-style learned heads differ from draft-model speculation.
How simple methods like n-gram and suffix decoding behave when the prompt contains a reusable structure.

The repo includes benchmark summaries, command lines, checkpoints/exports, and implementation notes. Some results are intentionally on small train-overlap eval slices due to compute constraints, so I would treat the numbers as implementation/behavioral benchmarks rather than broad generalization claims.

I built this mostly as a learning resource for people who want to understand speculative decoding at the algorithm + systems boundary: how the proposer is trained, how draft tokens are generated, how target verification works, what gets cached, and where the speedups actually come from.

submitted by /u/shreyansh26
[link] [comments]