Why Attend to Everything? Focus is the Key

arXiv cs.CL / 4/7/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Focus, an additive attention method that learns which token pairs matter using learnable centroids rather than approximating all pairs in attention.
  • Focus freezes all model weights and trains only centroid parameters (as few as ~148K), improving domain perplexity without degrading downstream benchmark performance across model sizes up to the 70B scale.
  • At inference, Focus discretizes routing via top-k group selection, producing hard sparsity that yields about 2× speedup while improving perplexity versus the pretrained baseline.
  • The authors report an 8.6× wall-clock speedup at 1M tokens by decomposing the routing pattern into two standard FlashAttention calls, avoiding custom kernels.
  • Focus is claimed to preserve instruction alignment better than LoRA (higher TruthfulQA retention) and uses Sinkhorn normalization to enforce balanced, interpretable linguistic groupings without supervision.

Abstract

We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.