HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

HY-Embodied-0.5 is introduced as a foundation model family tailored for real-world embodied agents, focusing on spatial/temporal visual perception and embodied reasoning for prediction, interaction, and planning.
The suite includes two variants—an efficient 2B-activated model for edge deployment and a 32B-activated model for more complex reasoning—aimed at balancing capability and practicality.
A Mixture-of-Transformers (MoT) architecture with modality-specific computation and latent tokens is used to strengthen fine-grained visual representations needed for embodied tasks.
The models’ reasoning is improved via an iterative, self-evolving post-training approach, and on-policy distillation transfers large-model capabilities to the smaller variant.
Benchmarks across 22 tasks show the 2B model beating similarly sized baselines on 16 benchmarks and the 32B variant reaching performance comparable to frontier systems, and the authors report real-world robot control gains using a Vision-Language-Action (VLA) model trained from their VLM foundation; code/models are open-sourced.

Abstract

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/10DailyView insight →

Black Hat Asia

AI Business

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

can we talk about how AI has gotten really good at lying to you?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer