Multimodal embodiment-aware navigation transformer

arXiv cs.RO / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes ViLiNT, a multimodal, embodiment-aware navigation transformer for goal-conditioned ground-robot navigation that targets robustness under distribution shifts affecting the environment, robot, or sensors.
  • ViLiNT fuses RGB images, 3D LiDAR point clouds, a goal embedding, and an embodiment descriptor in a transformer, using the transformer output to condition a diffusion model that generates candidate navigable trajectories.
  • It adds an offline-trained path-clearance prediction head to score and rank diffusion-generated trajectories, improving collision-avoidance by selecting safer paths.
  • The robot’s embodiment token is used in both diffusion conditioning and trajectory ranking, enabling generated and selected trajectories to respect the robot’s physical dimensions.
  • Experiments in three simulated environments show a 166% average Success Rate improvement over the vision-only baseline NoMaD, and real-world rover deployments in obstacle fields further confirm the approach’s robustness.

Abstract

Goal-conditioned navigation models for ground robots trained using supervised learning show promising zero-shot transfer, but their collision-avoidance capability nevertheless degrades under distribution shift, i.e. environmental, robot or sensor configuration changes. We propose ViLiNT a multimodal, attention-based policy for goal navigation, trained on heterogeneous data from multiple platforms and environments, which improves robustness with two key features. First, we fuse RGB images, 3D LiDAR point clouds, a goal embedding and a robot's embodiment descriptor with a transformer architecture to capture complementary geometry and appearance cues. The transformer's output is used to condition a diffusion model that generates navigable trajectories. Second, using automatically generated offline labels, we train a path clearance prediction head for scoring and ranking trajectories produced by the diffusion model. The diffusion conditioning as well as the trajectory ranking head depend on a robot's embodiment token that allows our model to generate and select trajectories with respect to the robot's dimensions. Across three simulated environments, ViLiNT improves Success Rate on average by 166\% over equivalent state-of-the-art vision-only baseline (NoMaD). This increase in performance is confirmed through real-world deployments of a rover navigating in obstacle fields. These results highlight that combining multimodal fusion with our collision prediction mechanism leads to improved off-road navigation robustness.