Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

arXiv cs.LG / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper addresses the limitation that standard bidirectional transformers are permutation-invariant unless explicit positional embeddings are used, unlike unidirectional attention which encodes order via its causal triangular mask.
It proposes Dual Triangle Attention, which splits each attention head’s query-key subspace into two complementary triangular masks to attend to past-and-self and future-and-self separately, preserving bidirectional context while retaining implicit positional bias.
The method is implemented in PyTorch using flex_attention as a single compiled kernel call and adds no extra learned parameters beyond standard multi-head attention.
Experiments on an argmax positional probe and masked language modeling for both natural language and protein sequences show Dual Triangle Attention can learn positional information without explicit positional embeddings, and performs strongly when combined with RoPE.

Abstract

Bidirectional transformers are the foundation of many sequence modeling tasks across natural, biological, and chemical language domains, but they are permutation-invariant without explicit positional embeddings. In contrast, unidirectional attention inherently encodes positional information through its triangular mask, enabling models to operate without positional embeddings altogether. Here, we introduce Dual Triangle Attention, a novel bidirectional attention mechanism that separates the query-key subspace of each attention head into two complementary triangular masks: one that attends to past-and-self positions and one that attends to future-and-self positions. This design provides bidirectional context while maintaining the causal mask's implicit positional inductive bias in both directions. Using PyTorch's flex_attention, Dual Triangle Attention is implemented as a single compiled kernel call with no additional parameters beyond standard multi-head attention. We evaluated Dual Triangle Attention across three settings: (1) a synthetic argmax position probe, (2) masked language modeling (MLM) on natural language, and (3) MLM on protein sequences. In the argmax task, both Dual Triangle Attention and causal attention learn positional information without explicit positional embeddings, whereas standard bidirectional attention cannot. In the MLM experiments, Dual Triangle Attention with Rotary Positional Embeddings (RoPE) achieved the best context extension performance and strong performance across the board. These findings suggest that Dual Triangle Attention is a viable attention mechanism for bidirectional transformers, with or without positional embeddings.