Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

arXiv cs.CL / 3/13/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

Speculative decoding uses multiple language models to accelerate inference and improve throughput.
The paper notes that prior throughput optimization relied on costly experimental approaches tied to LLM training.
It proposes a theory that analytically links key pre-trained LLM hyperparameters to the throughput of a downstream speculative decoding inference system.
The theory enables predicting throughput-optimal hyperparameters before pre-training, guiding model and system design.

Abstract

Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.