Qwen Introduced FlashQLA

Reddit r/LocalLLaMA / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • Qwen introduced FlashQLA, a set of high-performance linear attention kernels built on TileLang, aimed at improving efficiency for agentic AI on personal/edge devices.
  • The approach delivers reported speedups of 2–3× for the forward pass and about 2× for the backward pass, with stronger gains for TP setups, small models, and long-context workloads.
  • FlashQLA uses gate-driven automatic intra-card (intra-device) CP and a hardware-friendly algebraic reformulation to boost SM utilization.
  • Instead of fully fusing the entire GDN flow into one kernel, it splits the work into two kernels optimized for CP and backward efficiency, trading some extra memory I/O at large batch sizes for better real-world edge performance.
  • The backward pass was engineered as a 16-stage warp-specialized pipeline under tight on-chip memory constraints, reaching 2×+ kernel-level speedups; Qwen provides a blog and GitHub code release.
Qwen Introduced FlashQLA

Introducing FlashQLA: high-performance linear attention kernels built on TileLang.

2–3× forward speedup. 2× backward speedup.

💻 Purpose-built for agentic AI on your personal devices.

Key insights:

  1. Gate-driven automatic intra-card CP.

  2. Hardware-friendly algebraic reformulation.

  3. TileLang fused warp-specialized kernels.

FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads.

Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads.

The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups.

We hope this is useful to the community!

Learn more:

📖 Blog: https://qwen.ai/blog?id=flashqla

💻 Code: https://github.com/QwenLM/FlashQLA

submitted by /u/ResearchCrafty1804
[link] [comments]