Qwen Introduced FlashQLA

Reddit r/LocalLLaMA / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Read original →

共有:

Key Points

Qwen introduced FlashQLA, a set of high-performance linear attention kernels built on TileLang, aimed at improving efficiency for agentic AI on personal/edge devices.
The approach delivers reported speedups of 2–3× for the forward pass and about 2× for the backward pass, with stronger gains for TP setups, small models, and long-context workloads.
FlashQLA uses gate-driven automatic intra-card (intra-device) CP and a hardware-friendly algebraic reformulation to boost SM utilization.
Instead of fully fusing the entire GDN flow into one kernel, it splits the work into two kernels optimized for CP and backward efficiency, trading some extra memory I/O at large batch sizes for better real-world edge performance.
The backward pass was engineered as a 16-stage warp-specialized pipeline under tight on-chip memory constraints, reaching 2×+ kernel-level speedups; Qwen provides a blog and GitHub code release.

Introducing FlashQLA: high-performance linear attention kernels built on TileLang.

2–3× forward speedup. 2× backward speedup.

💻 Purpose-built for agentic AI on your personal devices.

Key insights:

Gate-driven automatic intra-card CP.
Hardware-friendly algebraic reformulation.
TileLang fused warp-specialized kernels.

FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads.

Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads.

The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups.

We hope this is useful to the community!

Learn more:

📖 Blog: https://qwen.ai/blog?id=flashqla

💻 Code: https://github.com/QwenLM/FlashQLA

submitted by /u/ResearchCrafty1804
[link] [comments]