Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
arXiv cs.LG / 4/22/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper addresses the limitation that standard bidirectional transformers are permutation-invariant unless explicit positional embeddings are used, unlike unidirectional attention which encodes order via its causal triangular mask.
- It proposes Dual Triangle Attention, which splits each attention head’s query-key subspace into two complementary triangular masks to attend to past-and-self and future-and-self separately, preserving bidirectional context while retaining implicit positional bias.
- The method is implemented in PyTorch using flex_attention as a single compiled kernel call and adds no extra learned parameters beyond standard multi-head attention.
- Experiments on an argmax positional probe and masked language modeling for both natural language and protein sequences show Dual Triangle Attention can learn positional information without explicit positional embeddings, and performs strongly when combined with RoPE.



