CoFL: Continuous Flow Fields for Language-Conditioned Navigation

arXiv cs.RO / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces CoFL, an end-to-end policy for language-conditioned navigation that outputs a continuous flow field from BEV observations and a language instruction.
  • Instead of predicting trajectories from a single start point, CoFL learns local motion vectors at arbitrary BEV locations, using each scene-instruction annotation as dense spatial supervision.
  • The approach generates trajectories from any starting position by numerically integrating the predicted flow field, supporting simple real-time rollouts and closed-loop recovery.
  • To scale training and evaluation, the authors build a dataset of 500k+ BEV image–instruction pairs with procedurally generated flow fields and trajectories derived from semantic maps from Matterport3D and ScanNet.
  • Experiments on strictly unseen scenes show CoFL outperforms modular vision-language planners and trajectory-generation policies in both precision and safety, and it also performs zero-shot in real-world tests with feasible closed-loop control.

Abstract

Existing language-conditioned navigation systems typically rely on modular pipelines or trajectory generators, but the latter use each scene--instruction annotation mainly to supervise one start-conditioned rollout. To address these limitations, we present CoFL, an end-to-end policy that maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. CoFL reformulates navigation as workspace-conditioned field learning rather than start-conditioned trajectory prediction: it learns local motion vectors at arbitrary BEV locations, turning each scene--instruction annotation into dense spatial control supervision. Trajectories are generated from any start by numerical integration of the predicted field, enabling simple real-time rollout and closed-loop recovery. To enable large-scale training and evaluation, we build a dataset of over 500k BEV image--instruction pairs, each procedurally annotated with a flow field and a trajectory derived from semantic maps built on Matterport3D and ScanNet. Evaluating on strictly unseen scenes, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and trajectory generation policies in both navigation precision and safety, while maintaining real-time inference. Finally, we deploy CoFL zero-shot in real-world experiments with BEV observations across multiple layouts, maintaining feasible closed-loop control and a high success rate.