Anyone who knows stuff about fpgas:
- What max model size can one be designed for (I've read 20-30m parameters max, is it possible to go for more if quantized - at a resonable price)?
- Taalas - is what they're doing with asics more viable (rumored? qwen 27b @10k tok/sec at apperantly <$800 hard)
Would speculative decoding here work? Are there other strategies that would be better here, if the smaller model generates at a 100x token speed?
Thanks!
[link] [comments]




