FPGAs for speculative decoding

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep Analysis

共有:

Key Points

The post discusses whether FPGAs could be used to accelerate speculative decoding, particularly when a smaller model can generate tokens far faster than a larger one.
It raises questions about the maximum feasible model size for FPGA-based designs and whether quantization would allow scaling beyond commonly cited 20–30M parameters at acceptable cost.
It compares FPGA/ASIC tradeoffs by asking whether work like “Taalas” (mentioned as rumored/industry efforts) makes FPGA-like approaches more viable versus specialized ASICs.
The author seeks alternative strategies that might outperform speculative decoding if the draft model is around 100× faster.
Overall, it’s a request for technical feasibility guidance and performance/cost tradeoffs for FPGA deployment in local LLM decoding workflows.

Anyone who knows stuff about fpgas:

- What max model size can one be designed for (I've read 20-30m parameters max, is it possible to go for more if quantized - at a resonable price)?
- Taalas - is what they're doing with asics more viable (rumored? qwen 27b @10k tok/sec at apperantly <$800 hard)

Would speculative decoding here work? Are there other strategies that would be better here, if the smaller model generates at a 100x token speed?

Thanks!

submitted by /u/dp3471
[link] [comments]