We've been exploring an alternative framing of positional encoding where instead of additively injecting position signals into token embeddings, you treat position as a geometric constraint on the manifold the embeddings are allowed to occupy.
The core idea:
- Standard additive PE shifts embeddings in ways that can interfere with semantic geometry
- Treating position as a manifold constraint instead preserves the semantic neighborhood structure
- This gives a cleaner separation between "what this token means" and "where this token sits"
- Preliminary results show more stable attention patterns on longer sequences without explicit length generalization tricks
The practical upshot seems to be better out-of-distribution length handling and less attention sink behavior, though we're still stress-testing the latter.
Whether this reads as a principled geometric reframing or just another way to regularize positional influence, genuinely not sure yet. Curious if this decomposition feels natural to people working on interpretability or long-context architectures.
arXiv link once we clean up the writeup.
[link] [comments]
