Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

MarkTechPost / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The QwenLM team released FlashQLA, a new high-performance kernel library focused on accelerating Gated Delta Network (GDN) Chunked Prefill for both forward and backward passes.
FlashQLA is designed for large-scale pretraining workloads and edge-side agentic inference scenarios.
The library reportedly delivers up to a 3× speedup on NVIDIA Hopper GPUs, indicating strong optimization for modern NVIDIA hardware.
By improving core attention-related compute, FlashQLA can reduce training and inference latency/cost for systems using the targeted GDN Chunked Prefill approach.
The release expands the available developer tooling around linear attention kernels, potentially making it easier to deploy faster attention implementations in production pipelines.

The QwenLM team has released FlashQLA, a new kernel library that dramatically accelerates the forward and backward passes of Gated Delta Network (GDN) Chunked Prefill, targeting both large-scale pretraining and edge-side agentic inference scenarios.

The post Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs appeared first on MarkTechPost.