Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?

Reddit r/LocalLLaMA / 5/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep Analysis

共有:

Key Points

The author argues that most benchmarks and discussions overly emphasize token generation speed (tokens/s), while prefill (prompt processing) is often the real source of user-perceived latency.
Based on their experiments across multiple models and GPUs, they report that generation becomes usable once decoding starts (e.g., ~15 t/s), but waiting for the model to ingest the prompt dominates total wall-clock time.
They cite an example with Qwen 27B Q6 where generation is around ~15 t/s but prefill is much faster at only ~300 t/s, and that they spend more time on prompt processing than on completing the reply.
They note that recent hype around MTP seems consistent with the idea that improving generation speed alone may not significantly reduce end-to-end time for common use cases.
The author asks whether others’ usage patterns differ, adding that their work is largely agentic (the model must ingest parts of a codebase), which makes prefill/context ingestion a bigger bottleneck than in normal chat.

I read this sub every day and I keep seeing benchmarks and discussions focused almost entirely on tokens/s generation speed. Prompt processing speed barely gets mentioned.

From my own experience running a bunch of different models on different GPUs for all kinds of tasks, the prefill stage is usually the part that actually feels slow. Once generation starts, even “only” 15 t/s is perfectly usable for me. The wait for the model to eat the prompt is what eats most of the time.

Seeing all the hype around MTP lately kind of reinforces that feeling. If generation speed improvements don’t really move the needle on total wall-clock time for typical use cases, why is everyone laser-focused on it?

For example, with Qwen 27B Q6 I’m getting ~15 t/s generation with my current setup (which feels fine no matter what I’m doing) but only ~300 t/s on prefill. I spend way more time staring at the processing than I do waiting for the actual reply to finish. Even with prompt caching.

Am I misunderstanding something about how most people use these models? Curious what others are seeing.

Edit: I forgot to mention that I mostly do agentic work, where the model has to ingest part of the codebase before it can actually do anything useful. For normal chat this obviously isn’t an issue, context stays small and you just need enough t/s to keep up with your reading speed.

submitted by /u/wbulot
[link] [comments]

Barry Diller trusts Sam Altman. But ‘trust is irrelevant’ as AGI nears, he says.

TechCrunch

Stop Credentialing Your AI Agents Like It's 2019

Dev.to

Why ISO/IEC 42001 is the New SOC 2 for AI Startups (And How to Prepare)

Dev.to

BizNode Workflow Marketplace: chain multiple bot handles into multi-step pipelines. Client onboarding, contract-to-payment,...

Dev.to

Weights & Biases New Master Service Agreement Questions [D]

Reddit r/MachineLearning

Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something?

Key Points

Related Articles

Barry Diller trusts Sam Altman. But ‘trust is irrelevant’ as AGI nears, he says.

Stop Credentialing Your AI Agents Like It's 2019

Why ISO/IEC 42001 is the New SOC 2 for AI Startups (And How to Prepare)

BizNode Workflow Marketplace: chain multiple bot handles into multi-step pipelines. Client onboarding, contract-to-payment,...

Weights & Biases New Master Service Agreement Questions [D]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer