Why Attend to Everything? Focus is the Key
arXiv cs.CL / 4/7/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Focus, an additive attention method that learns which token pairs matter using learnable centroids rather than approximating all pairs in attention.
- Focus freezes all model weights and trains only centroid parameters (as few as ~148K), improving domain perplexity without degrading downstream benchmark performance across model sizes up to the 70B scale.
- At inference, Focus discretizes routing via top-k group selection, producing hard sparsity that yields about 2× speedup while improving perplexity versus the pretrained baseline.
- The authors report an 8.6× wall-clock speedup at 1M tokens by decomposing the routing pattern into two standard FlashAttention calls, avoiding custom kernels.
- Focus is claimed to preserve instruction alignment better than LoRA (higher TruthfulQA retention) and uses Sinkhorn normalization to enforce balanced, interpretable linguistic groupings without supervision.
Related Articles

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter
TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to

Moving from proof of concept to production: what we learned with Nometria
Dev.to