Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch

Reddit r/LocalLLaMA / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The author created an educational GitHub repository that implements multiple speculative decoding methods from scratch under a shared decoding/evaluation interface to make proposer-design differences easier to study.
Implemented methods include EAGLE-3, Medusa-1, standard draft-model speculation, PARD (parallel draft models), plus n-gram prompt lookup and suffix decoding, with training and inference paths where applicable.
The repo uses Qwen/Qwen2.5-7B-Instruct as the target model for learned proposer variants and constructs training-free proposers directly from prompt/generated context.
The write-up emphasizes key distinctions such as proposer quality vs verifier cost, why high acceptance rate may not translate to higher throughput, and how approaches like PARD can outperform autoregressive draft models even with lower acceptance.
It also provides benchmark summaries, command lines, checkpoints/exports, and implementation notes, with results framed as implementation/behavioral benchmarks rather than broad generalization due to compute constraints.

I’ve been working on an educational implementation repo for speculative decoding:

https://github.com/shreyansh26/Speculative-Decoding

The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind a shared decoding/evaluation contract so that the differences between proposer designs are easier to study.

Implemented methods so far:

EAGLE-3
Medusa-1
standard draft model speculation
PARD / parallel draft models
n-gram prompt lookup
suffix decoding

The repo has both training and inference paths where applicable. For learned proposers, I use Qwen/Qwen2.5-7B-Instruct as the target model and small learned/speculative heads or draft models, depending on the method. For training-free methods, the proposer is built from the prompt/generated context.

A few things I wanted the repo to make explicit:

The distinction between proposer quality and verifier cost.
Why a high acceptance rate does not always imply higher throughput.
Why methods like PARD can be faster despite lower acceptance than an autoregressive draft model.
How EAGLE/Medusa-style learned heads differ from draft-model speculation.
How simple methods like n-gram and suffix decoding behave when the prompt contains a reusable structure.

The repo includes benchmark summaries, command lines, checkpoints/exports, and implementation notes. Some results are intentionally on small train-overlap eval slices due to compute constraints, so I would treat the numbers as implementation/behavioral benchmarks rather than broad generalization claims.

I built this mostly as a learning resource for people who want to understand speculative decoding at the algorithm + systems boundary: how the proposer is trained, how draft tokens are generated, how target verification works, what gets cached, and where the speedups actually come from.

submitted by /u/shreyansh26
[link] [comments]