Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

arXiv cs.CV / 3/12/2026

📰 NewsModels & Research

共有:

Key Points

The paper identifies that the distance-based inductive bias of Multimodal RoPE degrades inter-modal attention as the text sequence length increases, causing visual fading in long-context generation.
It proposes inter-modal Distance Invariant Position Encoding (DIPE), which disentangles position encoding by modality to preserve intra-modal locality while anchoring inter-modal proximity.
DIPE, when combined with Multimodal RoPE, mitigates the inter-modal distance penalty and keeps visual signals perceptually grounded across long contexts.
Experimental results show preserved performance on short-context benchmarks alongside significantly improved long-context visual grounding, with code available at the linked GitHub repository.

Abstract

Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

Reddit r/MachineLearning

My Experience with Qwen 3.5 35B

Reddit r/LocalLLaMA

Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4

VentureBeat

Qwen 3.5 122B completely falls apart at ~ 100K context

Reddit r/LocalLLaMA

Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

Key Points

Abstract

Related Articles

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

My Experience with Qwen 3.5 35B

Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4

Qwen 3.5 122B completely falls apart at ~ 100K context

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer