LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation

arXiv cs.RO / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that real-world vision-language navigation still suffers from visibly stop-and-go behavior because the sense–inference–execution loop is blocking, forcing the controller to wait for sensing/transmission/inference before moving again.
It introduces LiveVLN, a training-free runtime framework that augments pretrained VLM navigators with multi-step action continuation so the system can keep actions available while newly arrived observations are being processed.
LiveVLN overlaps execution with processing of fresh observations, handing off refreshed future actions before the currently executable action prefix is exhausted to reduce idle waiting.
Experiments on R2R and RxR show that the method preserves benchmark performance while reducing waiting time and improving action availability.
In deployment-oriented evaluations on StreamVLN and NaVIDA, LiveVLN reduces average episode waiting time by up to 77.7% and shortens wall-clock episode time by 12.6%–19.6%.

Abstract

Recent navigation systems achieve strong benchmark results, yet real-world deployment often remains visibly stop-and-go. This bottleneck arises because the sense-inference-execution loop is still blocking: after each new observation, the controller must wait for sensing, transmission, and inference before motion can continue. Reducing action-generation cost alone therefore does not remove redundant waiting. To address this issue, we present LiveVLN, a training-free framework for more continuous embodied navigation by augmenting pretrained VLM navigators with multi-step action continuation. Instead of pausing for each full sense-and-inference round, LiveVLN overlaps execution with the processing of newly arrived observations, allowing refreshed future actions to be handed off before the current executable prefix is exhausted. This design keeps actions continuously available during motion, reducing idle waiting and enabling smoother online execution. The framework operates at runtime and can be integrated with compatible pretrained VLM navigators. Across R2R and RxR, LiveVLN preserves benchmark performance while reducing waiting time and improving action availability. In real-world deployments, it cuts average episode waiting time by up to

77.7\%

and shortens wall-clock episode time by

12.6\%

on StreamVLN and

19.6\%

on NaVIDA, yielding more coherent execution during deployment. Code is available at https://github.com/NIneeeeeem/LiveVLN.

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

Dev.to

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering

Dev.to

LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation

Key Points

Abstract

Related Articles

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Deepseek V4 Flash and Non-Flash Out on HuggingFace

DeepSeek V4 Flash & Pro Now out on API

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

From "Hello World" to "Hello Agents": The Developer Keynote That Rewired Software Engineering

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer