Why people cares token/s in decoding more?

Reddit r/LocalLLaMA / 5/7/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep Analysis

共有:

Key Points

The author’s experience with local LLMs suggests that latency bottlenecks often come from prompt processing rather than the actual token decoding speed.
They note that if prompt processing is fast enough (e.g., after a typical initial overhead for agentic coding), generation can exceed ~10 tokens/second, which they question should be faster than human visual tracking.
They report that running Qwen3.6 27B on a Mac mini took over 10 minutes to process a 64K prompt, leading them to switch to a smaller 35B variant.
They ask what they might be missing—whether methods like MTP improve prompt processing speed—or whether the bottleneck behavior changes substantially with discrete GPU configurations.
Overall, the post frames a troubleshooting question about where time is spent in local LLM inference and what factors dominate end-to-end responsiveness.

What I've noticed while using local LLM recently is that in most cases, bottlenecks occur not in decoding but in prompt processing.

If the prompt processing speed is usable, in most settings (since it takes about 15k when starting based on agentic coding standard) it exceeds 10 tokens per second in generating, doesn't that exceed the speed we can follow with our eyes?

I tried to use qwen3.6 27b but it took more than 10m to process 64k prompt on my mac mini, so I rather chose 35b a3b

What am I missing? Is the prompt processing speed improved by MTP or other methods?

Or is bottleneck just really different with discrete gpu settings?

submitted by /u/Interesting-Print366
[link] [comments]