Is Attention sink without Positional Encoding unavoidable? [D]

Reddit r/MachineLearning / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post reports that removing positional encoding from self- or cross-attention causes “vertical hot lines” in attention heatmaps, suggesting queries collapse onto similar key token positions.
The author observes the same pathology even in causal self-attention without positional encoding, indicating the issue may be structural rather than limited to cross-attention.
Adding positional encoding (e.g., RoPE or similar) yields more diagonal-like patterns, implying PE helps break symmetry in attention patterns.
The core question asks whether query-conditioned, token-specific attention can be achieved without positional encoding, and what architectural or training changes could prevent attention collapse.
Attempts with regularization to spread attention do not eliminate the vertical-line failure mode, motivating further investigation into the underlying cause.

Is Attention sink without Positional Encoding unavoidable? [D]

TL;DR: As soon as I remove Positional Encoding (PE) from Self or Cross-attention, I start seeing vertical hot lines in attention heatmaps. Is there any way to make a model have query-conditioned attention without PE?

So, I've been trying to pre-train a couple types of Transformer based models (small, tinkering level only), Encoder-Decoder model and Cross-attention memory only model (basically, removing FFNs and using cross-attended vectors as memory banks instead), namely. But every-time I try to train cross-attention, I see vertical lines as shown in the image attached. And I'm guessing that means every query vector is attending to the same key tokens. This is while I don't use RoPE or any other PE during cross-attention. I start to see some diagonals when I add PE, though I do not think I should need to add it during cross-attention, as queries and keys are representations of different data.

And this shows up in simple Causal Self-attention too, as soon as I remove PE.

My question is, how do I force the model to attend to key tokens dynamically based on query token?

I've already tried regularization such that attention is more spread out, which does make the attention more spread out, but still in vertical lines, no diagonals, or any other pattern.

submitted by /u/PreetamSing
[link] [comments]

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison

Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

Dev.to

Is Attention sink without Positional Encoding unavoidable? [D]

Key Points

Related Articles

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Agent Amnesia and the Case of Henry Molaison

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer