Exclusive Self Attention

Apple Machine Learning Journal / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The article proposes a new Transformer attention variant called Exclusive Self Attention (XSA), which modifies standard self attention (SA) to improve sequence modeling performance.
  • XSA constrains attention to focus on information orthogonal to a token’s own value vector, aiming to exclude self-position information while strengthening contextual modeling.
  • Experiments on standard language modeling show XSA consistently outperforms SA across model sizes up to 2.7B parameters.
  • The reported performance gains increase with longer sequence lengths, suggesting XSA is especially beneficial in long-context settings.
We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer’s sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token’s own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.