Acceptable prompt processing speed for you?

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

Key Points

  • A user optimizing older hardware (Qwen3 on 4×V100 GPUs) reports that the lack of Flash Attention causes substantial slowdowns at longer context lengths.
  • The post asks the community what prompt processing speeds and context sizes are considered acceptable or “good” for agentic coding workflows.
  • The discussion is framed around practical performance trade-offs between throughput/latency and usable context length in local LLM deployments.
  • It highlights that long-context performance can be heavily affected by specific attention implementations and hardware constraints, not just model choice.
  • The request is primarily experiential and opinion-driven, aiming to set expectations for real-world usability rather than to introduce a new technical release.
Acceptable prompt processing speed for you?

I am currently optimising some ancient hardware to run qwen3 (4xV100s) but the lack of flash attention means that at longer contexts the processing starts to really slow down.

For agentic coding work what processing speeds and contexts lengths do you consider as acceptable or good?

submitted by /u/Simple_Library_2700
[link] [comments]