CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics
arXiv cs.RO / 4/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces CLaD, a robotics planning framework that explicitly aligns kinematic (proprioceptive) and semantic state transitions rather than planning in only one space.
- CLaD uses asymmetric cross-attention where kinematic transitions query semantic ones, enabling “grounded latent foresight” predictions conditioned on both modalities.
- It trains with self-supervised objectives, EMA target encoders, and auxiliary reconstruction losses to reduce representation collapse while keeping predictions anchored to observable states.
- The predicted foresights are combined with current observations to condition a diffusion policy that generates actions.
- On the LIBERO-LONG benchmark, CLaD reports a 94.7% success rate while remaining competitive with large vision-language-action models using fewer parameters.
Related Articles

Day 6: I Stopped Writing Articles and Started Hunting Bounties
Dev.to

Early Detection of Breast Cancer using SVM Classifier Technique
Dev.to

I Started Writing for Others. It Changed How I Learn.
Dev.to

10 лучших курсов по prompt engineering бесплатно: секреты успеха пошагово!
Dev.to

Prompt Engineering at Workplace: How I Used Amazon Q Developer to Boost Team Productivity by 30%
Dev.to