Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the scalability bottleneck of quadratic-complexity attention in long-context LLM inference by proposing context-aware hybrid attention rather than static FA/SA mixing ratios.
  • Flux Attention introduces a lightweight Layer Router that dynamically selects, at the layer level, whether each layer uses Full Attention or Sparse Attention based on the current input context.
  • The method targets hardware efficiency issues seen in head-level dynamic sparsity by aiming for contiguous memory access and reducing load imbalance during autoregressive decoding.
  • It claims practical wall-clock speedups (up to 2.8× for prefill and 2.0× for decode) while maintaining strong performance on long-context and mathematical reasoning benchmarks.
  • The framework is described as parameter-efficient, requiring only about 12 hours of training on 8×A800 GPUs while keeping the underlying pretrained LLM weights frozen.

Abstract

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8\timesA800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to 2.8\times and 2.0\times in the prefill and decode stages.

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference | AI Navigate