Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]

Reddit r/MachineLearning / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author released an educational GitHub repository that implements multiple speculative decoding methods from scratch behind a shared decoding/evaluation contract to enable easier comparison of proposer designs.
  • Implemented approaches include EAGLE-3, Medusa-1, standard draft-model speculation, PARD/parallel draft models, as well as training-free n-gram prompt lookup and suffix decoding.
  • The repo provides both training and inference paths when applicable, using Qwen/Qwen2.5-7B-Instruct as the target model with smaller learned heads or draft models, depending on the method.
  • The project explicitly highlights key trade-offs such as proposer quality vs verifier cost, why higher acceptance rate doesn’t necessarily mean higher throughput, and how methods like PARD can outperform autoregressive draft modeling despite lower acceptance.
  • Benchmark summaries and implementation notes are included, with results intended as behavioral/implementation benchmarks on limited eval slices due to compute constraints rather than broad performance claims.

I’ve been working on an educational implementation repo for speculative decoding:

https://github.com/shreyansh26/Speculative-Decoding

The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind a shared decoding/evaluation contract so that the differences between proposer designs are easier to study.

Implemented methods so far:

  • EAGLE-3
  • Medusa-1
  • standard draft model speculation
  • PARD / parallel draft models
  • n-gram prompt lookup
  • suffix decoding

The repo has both training and inference paths where applicable. For learned proposers, I use Qwen/Qwen2.5-7B-Instruct as the target model and small learned/speculative heads or draft models, depending on the method. For training-free methods, the proposer is built from the prompt/generated context.

A few things I wanted the repo to make explicit:

  1. The distinction between proposer quality and verifier cost.
  2. Why a high acceptance rate does not always imply higher throughput.
  3. Why methods like PARD can be faster despite lower acceptance than an autoregressive draft model.
  4. How EAGLE/Medusa-style learned heads differ from draft-model speculation.
  5. How simple methods like n-gram and suffix decoding behave when the prompt contains a reusable structure.

The repo includes benchmark summaries, command lines, checkpoints/exports, and implementation notes. Some results are intentionally on small train-overlap eval slices due to compute constraints, so I would treat the numbers as implementation/behavioral benchmarks rather than broad generalization claims.

I built this mostly as a learning resource for people who want to understand speculative decoding at the algorithm + systems boundary: how the proposer is trained, how draft tokens are generated, how target verification works, what gets cached, and where the speedups actually come from.

submitted by /u/shreyansh26
[link] [comments]