Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding
arXiv cs.CV / 3/12/2026
📰 NewsModels & Research
Key Points
- The paper identifies that the distance-based inductive bias of Multimodal RoPE degrades inter-modal attention as the text sequence length increases, causing visual fading in long-context generation.
- It proposes inter-modal Distance Invariant Position Encoding (DIPE), which disentangles position encoding by modality to preserve intra-modal locality while anchoring inter-modal proximity.
- DIPE, when combined with Multimodal RoPE, mitigates the inter-modal distance penalty and keeps visual signals perceptually grounded across long contexts.
- Experimental results show preserved performance on short-context benchmarks alongside significantly improved long-context visual grounding, with code available at the linked GitHub repository.
Related Articles
Data Augmentation Using GANs
Dev.to
ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
arXiv cs.AI
Hyperagents
arXiv cs.AI
Teaching an Agent to Sketch One Part at a Time
arXiv cs.AI
PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management
arXiv cs.AI