VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

arXiv cs.RO / 4/30/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces VLN-Cache, a training-free token caching method for vision-and-language navigation (VLN) models to reduce inference cost for real-time use.
It argues that prior caching approaches break down in VLN because both visual dynamics (viewpoint changes move token positions) and semantic dynamics (token relevance changes across navigation stages) make cached tokens misaligned or stale.
VLN-Cache addresses these issues with view-aligned remapping to restore geometric correspondences and a task-relevance saliency filter that prevents reuse at semantic transition points.
It also uses a layer-adaptive entropy policy to manage a per-layer reuse budget, improving the trade-off between speed and accuracy.
On the R2R-CE simulation benchmark, VLN-Cache achieves up to 1.52x faster inference while keeping competitive navigation success rates.

Abstract

Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token positions across frames, causing position-wise matching to pair misaligned content; (2) semantic dynamics, where token relevance shifts across task stages as navigation progresses, making cached states stale. We propose VLN-Cache, a visual-dynamic-aware and semantic-dynamic-aware caching framework that introduces view-aligned remapping to recover geometric correspondences and a task-relevance saliency filter to veto reuse at semantic transitions. A layer-adaptive entropy policy further balances the per-layer reuse budget. Experiments on the R2R-CE simulation benchmark show up to 1.52x speedup while maintaining competitive navigation success rates.