OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
arXiv cs.RO / 4/21/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses the high latency of autoregressive Chain-of-Thought (CoT) reasoning in vision-language-action (VLA) autonomous driving by proposing a one-step latent alternative.
- It introduces OneVL, a unified VLA plus world model framework that compresses reasoning into compact latent tokens trained with dual auxiliary decoders.
- Unlike prior latent CoT approaches that rely mainly on linguistic representations, OneVL adds a visual world model decoder to predict future-frame tokens and embed causal road-and-agent dynamics into the latent space.
- A three-stage training pipeline progressively aligns latent tokens with trajectory, language, and visual objectives for stable joint optimization, and inference discards the auxiliary decoders while using a single parallel pass.
- On four benchmarks, OneVL is reported to be the first latent CoT method to outperform explicit CoT, achieving state-of-the-art accuracy at answer-only latency.
Related Articles

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM
Reddit r/LocalLLaMA