| Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share. As always your time is precious, so I'll keep it short. We built speculative prefill for long-context decode on quantized 27B targets, C++/CUDA only. A small drafter loaded in-process scores token importance over the full prompt; the heavy target only prefills the spans that matter. Repo: github.com/Luce-Org/lucebox-hub (open source, MIT). Head-to-head on Qwen3.6-27B Q4_K_M, RTX 3090, single-shot: 24.8 s TTFT vs ~257 s for vanilla llama.cpp = ~10.4× at 128K (and 13.5 s vs 134.95 s = 10.0× at 64K), with NIAH retrieval preserved end-to-end. No Python, no Triton, no PyTorch in the inference loop. The problem Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 248.4 s cold (llama-bench pp131072 --no-warmup -r 1, 527.6 tok/s). That is 4.1 minutes staring at a blank screen before the first token. Decode is fast, but the wait kills the UX. Warmed steady-state is better (169.3 s at 128K) but still painful, and grows quadratically as you push context. Standing on shoulders This work stands on two recent papers, both excellent reads:
Our contribution is the C++/CUDA composition of these two algorithms, in-process, on a 24 GB consumer card. As far as we are aware, the two papers had not been combined in an open implementation before. What we built
Setup bash Numbers Single-shot on RTX 3090, Qwen3.6-27B Q4_K_M target, q4_0 KV, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05. NIAH single-needle as the end-to-end retrieval check. Baseline is vanilla llama.cpp with default f16 KV (apples-to-oranges on KV; q4_0 KV costs ~3% AL at short context, 8.56 to 8.33, benchmarked).
These are cold-cache numbers (first request after process boot). Warmed-vs-warmed is a smaller multiplier because llama.cpp settles into ~169 s at 128K once caches are hot. Both numbers are real and the right one depends on your workload; if you keep an engine resident, use warmed. Decode after prefill is the standard DFlash spec-decode path with DDTree (~74 tok/s sustained on Qwen3.6-27B Q4_K_M). Quality NIAH single-needle (magic-key + 7-digit answer randomly placed in filler) retrieved at every context tested from 32K through 128K, keep_ratio=0.05, DFLASH_FP_ALPHA=0.85. Honest flag: NIAH single-needle is a structurally easy probe for an attention-based selection method like ours, since the algorithm is well-suited to finding a single high-attention span. RULER and NIAH multi-needle are next on the list; a fair audit should wait for those numbers. Why the stack works Speculative prefill solves a quality problem: how do you compress without losing the answer-relevant content? FlashPrefill solves a speed problem inside the drafter step: how do you make the drafter fast enough at 128K that it doesn't become the bottleneck. They compose cleanly because the target side (DFlash spec decode) is unchanged; it just receives a much shorter prompt with full attention enabled. At 128K, drafter scoring is now the dominant cost (~12 s of the 24.8 s TTFT). Target prefill on the compressed ~6.5K survivors is ~10 s; the remaining ~3 s is the park/unpark/free dance. The next obvious lever is a smaller or distilled drafter, which we have not done yet. Tuning bash keep_ratio=0.05 is the default. 0.02 cuts target prefill from ~10 s to ~3 s but starts losing the needle. DFLASH_FP_ALPHA=0.99 cuts ~1 s at 128K with a small NIAH-margin loss. Calibration territory. Any feedback is more than welcome! [link] [comments] |
PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090
Reddit r/LocalLLaMA / 5/1/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- PFlash proposes an in-process “speculative prefill” approach for long-context decoding on quantized 27B targets, using C++/CUDA only, to skip computation for prompt spans that are unlikely to matter.
- On an RTX 3090 with Qwen3.6-27B Q4_K_M, it reports roughly 10× faster TTFT at 128K context (24.8s vs ~257s for vanilla llama.cpp), and about 10× faster at 64K (13.5s vs 134.95s).
- The method preserves end-to-end NIAH retrieval while avoiding a slow prefill bottleneck, targeting the typical O(S²) scaling cost that makes ultra-long prompts feel unresponsive.
- The work combines ideas from Speculative Prefill, Cross-Family Speculative Prefill, and FlashPrefill (plus block-sparse attention) into an open-source repository (MIT) without requiring Python, Triton, or PyTorch in the inference loop.
Related Articles

Black Hat USA
AI Business

GPT-5.5 Outperforms (and Hallucinates), Kimi K2.6 Leads Open LLMs, AI Strains Climate Pledges, Strategic Thinking in LLMs vs. Humans
The Batch
langchain-openrouter==0.2.3
LangChain Releases

Stop Your RAG Pipeline From Hallucinating: A 15-Line Fix published
Dev.to
Edge-to-Cloud Swarm Coordination for smart agriculture microgrid orchestration with embodied agent feedback loops
Dev.to