UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

arXiv cs.RO / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes UniLACT, a depth-aware transformer-based vision-language-action (VLA) model that improves latent action pretraining by incorporating 3D geometric structure instead of relying on RGB appearance alone.
It introduces UniLARN, a unified latent action learning framework that uses inverse and forward dynamics objectives to learn a shared embedding space for RGB and depth while explicitly modeling cross-modal interactions.
The learned modality-specific and unified latent action representations are used as pseudo-labels to enable depth-aware pretraining, giving downstream policies stronger spatial priors for contact-rich manipulation.
Experiments in both simulation and real-world settings show that UniLACT outperforms RGB-based latent action baselines across in-domain and out-of-domain pretraining, including both seen and unseen tasks.

Abstract

Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.The project page is at https://manishgovind.github.io/unilact-vla/

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/10DailyView insight →

Black Hat Asia

AI Business

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

The $50,000 Build with MeDo Hackathon is NOW LIVE!

Dev.to

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

The $50,000 Build with MeDo Hackathon is NOW LIVE!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer