UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
arXiv cs.RO / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes UniLACT, a depth-aware transformer-based vision-language-action (VLA) model that improves latent action pretraining by incorporating 3D geometric structure instead of relying on RGB appearance alone.
- It introduces UniLARN, a unified latent action learning framework that uses inverse and forward dynamics objectives to learn a shared embedding space for RGB and depth while explicitly modeling cross-modal interactions.
- The learned modality-specific and unified latent action representations are used as pseudo-labels to enable depth-aware pretraining, giving downstream policies stronger spatial priors for contact-rich manipulation.
- Experiments in both simulation and real-world settings show that UniLACT outperforms RGB-based latent action baselines across in-domain and out-of-domain pretraining, including both seen and unseen tasks.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



