Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the scalability bottleneck of quadratic-complexity attention in long-context LLM inference by proposing context-aware hybrid attention rather than static FA/SA mixing ratios.
Flux Attention introduces a lightweight Layer Router that dynamically selects, at the layer level, whether each layer uses Full Attention or Sparse Attention based on the current input context.
The method targets hardware efficiency issues seen in head-level dynamic sparsity by aiming for contiguous memory access and reducing load imbalance during autoregressive decoding.
It claims practical wall-clock speedups (up to 2.8× for prefill and 2.0× for decode) while maintaining strong performance on long-context and mathematical reasoning benchmarks.
The framework is described as parameter-efficient, requiring only about 12 hours of training on 8×A800 GPUs while keeping the underlying pretrained LLM weights frozen.

Abstract

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8

\times

A800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to

2.8\times

and

2.0\times

in the prefill and decode stages.

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Dev.to

Emergency Room and the Vanishing Moat

Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

Dev.to

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Key Points

Abstract

Related Articles

GLM 5.1 tops the code arena rankings for open models

can we talk about how AI has gotten really good at lying to you?

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Emergency Room and the Vanishing Moat

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer