HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents
arXiv cs.CV / 4/10/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- HY-Embodied-0.5 is introduced as a foundation model family tailored for real-world embodied agents, focusing on spatial/temporal visual perception and embodied reasoning for prediction, interaction, and planning.
- The suite includes two variants—an efficient 2B-activated model for edge deployment and a 32B-activated model for more complex reasoning—aimed at balancing capability and practicality.
- A Mixture-of-Transformers (MoT) architecture with modality-specific computation and latent tokens is used to strengthen fine-grained visual representations needed for embodied tasks.
- The models’ reasoning is improved via an iterative, self-evolving post-training approach, and on-policy distillation transfers large-model capabilities to the smaller variant.
- Benchmarks across 22 tasks show the 2B model beating similarly sized baselines on 16 benchmarks and the 32B variant reaching performance comparable to frontier systems, and the authors report real-world robot control gains using a Vision-Language-Action (VLA) model trained from their VLM foundation; code/models are open-sourced.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

GLM 5.1 tops the code arena rankings for open models
Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works
Dev.to
can we talk about how AI has gotten really good at lying to you?
Reddit r/artificial