PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • PFlash proposes an in-process “speculative prefill” approach for long-context decoding on quantized 27B targets, using C++/CUDA only, to skip computation for prompt spans that are unlikely to matter.
  • On an RTX 3090 with Qwen3.6-27B Q4_K_M, it reports roughly 10× faster TTFT at 128K context (24.8s vs ~257s for vanilla llama.cpp), and about 10× faster at 64K (13.5s vs 134.95s).
  • The method preserves end-to-end NIAH retrieval while avoiding a slow prefill bottleneck, targeting the typical O(S²) scaling cost that makes ultra-long prompts feel unresponsive.
  • The work combines ideas from Speculative Prefill, Cross-Family Speculative Prefill, and FlashPrefill (plus block-sparse attention) into an open-source repository (MIT) without requiring Python, Triton, or PyTorch in the inference loop.
PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090

Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share. As always your time is precious, so I'll keep it short.

We built speculative prefill for long-context decode on quantized 27B targets, C++/CUDA only. A small drafter loaded in-process scores token importance over the full prompt; the heavy target only prefills the spans that matter.

Repo: github.com/Luce-Org/lucebox-hub (open source, MIT).

Head-to-head on Qwen3.6-27B Q4_K_M, RTX 3090, single-shot: 24.8 s TTFT vs ~257 s for vanilla llama.cpp = ~10.4× at 128K (and 13.5 s vs 134.95 s = 10.0× at 64K), with NIAH retrieval preserved end-to-end. No Python, no Triton, no PyTorch in the inference loop.

The problem

Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 248.4 s cold (llama-bench pp131072 --no-warmup -r 1, 527.6 tok/s). That is 4.1 minutes staring at a blank screen before the first token. Decode is fast, but the wait kills the UX. Warmed steady-state is better (169.3 s at 128K) but still painful, and grows quadratically as you push context.

Standing on shoulders

This work stands on two recent papers, both excellent reads:

  • Speculative Prefill (Liu et al, arXiv 2502.02789) and Cross-Family Speculative Prefill (SambaNova, ICLR 2026). Insight: a small draft model's attention pattern over a long prompt faithfully predicts which tokens matter for the answer. Run the draft, score per-token importance, keep the top spans, drop the rest.
  • FlashPrefill (Fan et al, 2026). Block-sparse attention so the drafter itself does not pay O(S²) at 128K.
  • mit-han-lab/Block-Sparse-Attention (BSA) for the FA-2-derived sm_80+ sparse forward.
  • ggml / llama.cpp for the runtime. We link libggml*.a and never libllama.

Our contribution is the C++/CUDA composition of these two algorithms, in-process, on a 24 GB consumer card. As far as we are aware, the two papers had not been combined in an open implementation before.

What we built

  • In-process composition. Drafter forward (custom Qwen3-0.6B BF16 ggml graph), FlashPrefill scoring, sparse attention, target prefill, and DFlash spec decode all run in one C++/CUDA process sharing one ggml allocator. No subprocess, no IPC, no Python, Triton, or PyTorch in the inference loop.
  • CUDA port of FlashPrefill. The reference (qhfan/FlashPrefill) is Triton. We wrote 4 CUDA kernels from scratch (mean_K, score, select, sparse_fwd) and dispatched the sparse forward through mit-han-lab/Block-Sparse-Attention. BSA ships as a libtorch C++ extension; pulling 2 GB of libtorch into a 24 GB inference loop was a non-starter, so we wired it in via a 3-header ATen/c10 stub set under dflash/deps/bsa_stubs/.
  • 24 GB memory orchestration. Drafter (1.3 GB weights + KV + ~600 MB BSA scratch at 128K) and the DFlash daemon (15 GB target + 3 GB draft + 3 GB KV) do not coexist on a 3090. The daemon parks, unparks, and frees weights between stages over a stdin protocol; ~3 s per request, makes the whole pipeline fit on a single consumer card.

Setup

bash

# clone with submodules (pulls llama.cpp/ggml + Block-Sparse-Attention) git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub cd lucebox-hub/dflash # build dflash + BSA kernel (sm_80+, ~10 min cold compile pulls cutlass) cmake -B build -S . -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CUDA_ARCHITECTURES=86 \ -DDFLASH27B_ENABLE_BSA=ON cmake --build build --target test_dflash test_flashprefill_kernels -j # fetch weights (target + drafter + spec-decode draft) huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/ huggingface-cli download Qwen/Qwen3-0.6B model.safetensors tokenizer.json --local-dir models/drafter/ huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/ # bench cd ../pflash && pip install -e . python tests/niah_gen.py --n 1 --ctx 131072 --out /tmp/niah_128k.jsonl python tests/bench_niah_cpp.py \ --bin ../dflash/build/test_dflash \ --target ../dflash/models/Qwen3.6-27B-Q4_K_M.gguf \ --draft ../dflash/models/draft/model.safetensors \ --drafter-gguf ../dflash/models/drafter/qwen3-0.6b.gguf \ --cases /tmp/niah_128k.jsonl --keep-ratio 0.05 

Numbers

Single-shot on RTX 3090, Qwen3.6-27B Q4_K_M target, q4_0 KV, DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 keep_ratio=0.05. NIAH single-needle as the end-to-end retrieval check. Baseline is vanilla llama.cpp with default f16 KV (apples-to-oranges on KV; q4_0 KV costs ~3% AL at short context, 8.56 to 8.33, benchmarked).

Context PFlash TTFT llama.cpp cold Speedup (cold) llama.cpp warmed
64K 13.5 s 134.95 s 10.0x (smaller)
128K 24.8 s 248.4 s 10.0x 169.3 s

These are cold-cache numbers (first request after process boot). Warmed-vs-warmed is a smaller multiplier because llama.cpp settles into ~169 s at 128K once caches are hot. Both numbers are real and the right one depends on your workload; if you keep an engine resident, use warmed.

Decode after prefill is the standard DFlash spec-decode path with DDTree (~74 tok/s sustained on Qwen3.6-27B Q4_K_M).

Quality

NIAH single-needle (magic-key + 7-digit answer randomly placed in filler) retrieved at every context tested from 32K through 128K, keep_ratio=0.05, DFLASH_FP_ALPHA=0.85.

Honest flag: NIAH single-needle is a structurally easy probe for an attention-based selection method like ours, since the algorithm is well-suited to finding a single high-attention span. RULER and NIAH multi-needle are next on the list; a fair audit should wait for those numbers.

Why the stack works

Speculative prefill solves a quality problem: how do you compress without losing the answer-relevant content? FlashPrefill solves a speed problem inside the drafter step: how do you make the drafter fast enough at 128K that it doesn't become the bottleneck. They compose cleanly because the target side (DFlash spec decode) is unchanged; it just receives a much shorter prompt with full attention enabled.

At 128K, drafter scoring is now the dominant cost (~12 s of the 24.8 s TTFT). Target prefill on the compressed ~6.5K survivors is ~10 s; the remaining ~3 s is the park/unpark/free dance. The next obvious lever is a smaller or distilled drafter, which we have not done yet.

Tuning

bash

DFLASH_FP_USE_BSA=1 # dispatch sparse FA forward through BSA (sm_80+, required for 10x) DFLASH_FP_ALPHA=0.85 # block-selection threshold; higher = stricter = fewer K-blocks per Q-row DFLASH_FP_PROFILE=1 # log per-stage timings (mean_K / score / select / forward) 

keep_ratio=0.05 is the default. 0.02 cuts target prefill from ~10 s to ~3 s but starts losing the needle. DFLASH_FP_ALPHA=0.99 cuts ~1 s at 128K with a small NIAH-margin loss. Calibration territory.

Any feedback is more than welcome!

submitted by /u/sandropuppo
[link] [comments]