| Hey fellow Llamas, your time is precious, so I'll keep it short. We built a GGUF port of DFlash speculative decoding. Standalone C++/CUDA stack on top of ggml, runs on a single 24 GB RTX 3090, hosts the new Qwen3.6-27B. We call it Luce DFlash (https://github.com/Luce-Org/lucebox-hub; MIT) ~1.98x mean over autoregressive on Qwen3.6 across HumanEval / GSM8K / Math500, with zero retraining (z-lab published a matched Qwen3.6-DFlash draft on 2026-04-26, still under training, so AL should keep climbing). If you have CUDA 12+ and an NVIDIA GPU (RTX 3090 / 4090 / 5090, DGX Spark, other Blackwell, or Jetson AGX Thor with CUDA 13+), all you need is # After cloning the repo (link in the first comment):
# Fetch target (~16 GB)
# Matched 3.6 draft is gated: accept terms + set HF_TOKEN first
# Run
That's it. No Python runtime in the engine, no llama.cpp install, no vLLM, no SGLang. The binary links libggml*.a and never libllama. Luce DFlash will
Running on RTX 3090, Qwen3.6-27B UD-Q4_K_XL (unsloth Dynamic 2.0) target, 10 prompts/dataset, n_gen=256:
As you can see, the speedup is real on consumer hardware, not a paper number. Target graph produces bit-identical output to autoregressive in AR mode; the draft graph matches the z-lab PyTorch reference at cos sim 0.999812. Q4_0 KV costs ~3% AL at short context (8.56 to 8.33) and wins at long context where F16 won't fit anyway. Constraints: CUDA only, greedy verify only (temperature/top_p on the OpenAI server are accepted and ignored), no Metal / ROCm / multi-GPU. Repo started single-3090, recent community PRs added support for RTX 5090, DGX Spark / GB10, other Blackwell cards, and Jetson AGX Thor (sm_110 + CUDA 13). Feedback more than welcome! [link] [comments] |
Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090
Reddit r/LocalLLaMA / 4/28/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- Luce DFlash is a GGUF-based, standalone C++/CUDA speculative decoding implementation that runs Qwen3.6-27B on a single 24GB RTX 3090 and claims up to about 2× throughput.
- The project reportedly improves performance without retraining, reporting ~1.98× mean speedup on Qwen3.6 benchmarks (HumanEval, GSM8K, Math500) using a matched Qwen3.6-DFlash draft.
- It optimizes inference by compressing the KV cache (TQ3_0), enabling long contexts (up to 256K within 24GB), increasing prefill batching (up to 16→192 for long prompts), and using sliding-window flash attention to maintain decode speed at large context lengths.
- Users can build and run it via provided CMake + Hugging Face download commands, and it can be served through an OpenAI-compatible HTTP endpoint or a local chat REPL.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business

China’s DeepSeek prices new V4 AI model at 97% below OpenAI’s GPT-5.5
SCMP Tech

I built Dispatch AI. I just wanted to share it. If you find it cool, take a look and leave a comment.
Dev.to

Replit AI Agent: Practical Guide for Dev Workflows
Dev.to

Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks
VentureBeat