Exclusive Self Attention
Apple Machine Learning Journal / 3/25/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The article proposes a new Transformer attention variant called Exclusive Self Attention (XSA), which modifies standard self attention (SA) to improve sequence modeling performance.
- XSA constrains attention to focus on information orthogonal to a token’s own value vector, aiming to exclude self-position information while strengthening contextual modeling.
- Experiments on standard language modeling show XSA consistently outperforms SA across model sizes up to 2.7B parameters.
- The reported performance gains increase with longer sequence lengths, suggesting XSA is especially beneficial in long-context settings.
We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer’s sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token’s own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.
Related Articles
AgentDesk vs Hiring Another Consultant: A Cost Comparison
Dev.to
"Why Your AI Agent Needs a System 1"
Dev.to
When should we expect TurboQuant?
Reddit r/LocalLLaMA
AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia
Dev.to
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Dev.to