| I am currently optimising some ancient hardware to run qwen3 (4xV100s) but the lack of flash attention means that at longer contexts the processing starts to really slow down. For agentic coding work what processing speeds and contexts lengths do you consider as acceptable or good? [link] [comments] |
Acceptable prompt processing speed for you?
Reddit r/LocalLLaMA / 4/19/2026
💬 OpinionSignals & Early TrendsTools & Practical Usage
Key Points
- A user optimizing older hardware (Qwen3 on 4×V100 GPUs) reports that the lack of Flash Attention causes substantial slowdowns at longer context lengths.
- The post asks the community what prompt processing speeds and context sizes are considered acceptable or “good” for agentic coding workflows.
- The discussion is framed around practical performance trade-offs between throughput/latency and usable context length in local LLM deployments.
- It highlights that long-context performance can be heavily affected by specific attention implementations and hardware constraints, not just model choice.
- The request is primarily experiential and opinion-driven, aiming to set expectations for real-world usability rather than to introduce a new technical release.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business
Are we confusing Agent Execution Runtimes with true Agent Runtime Environments? [D]
Reddit r/MachineLearning

How to Debug AI-Generated Code: A Systematic Approach
Dev.to

"Browser OS" implemented by Qwen 3.6 35B: The best result I ever got from a local model
Reddit r/LocalLLaMA