E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion

arXiv cs.RO / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces E0, a Tweedie discrete diffusion framework for Vision-Language-Action (VLA) models that generates robot actions as iterative denoising over quantized action tokens.
It argues that prior VLA generalization and action-quality issues stem from action distribution structure, token-based symbolic reasoning in VLM/VLA backbones, and practical finite control resolution.
E0 is designed to improve fine-grained yet executable action control and to mitigate distribution mismatch issues seen in masking-based discrete diffusion approaches.
The method also adds spherical viewpoint perturbation augmentation to improve robustness to camera viewpoint changes without collecting additional data.
Experiments across LIBERO, VLABench, ManiSkill, and a real-world Franka arm report state-of-the-art results in 14 environments, with an average 10.7% gain over strong baselines.

Abstract

Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. However, existing VLA systems still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We argue that these limitations are closely tied to the structural properties of actions in VLA settings, including the inherent multi-peaked nature of action distributions, the token-based symbolic reasoning of pretrained VLM/VLA backbones, and the effective finite resolution imposed by real-world robotic control. Motivated by these properties, we introduce E0, a tweedie discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. By operating in a discrete action space with a principled diffusion process, E0 naturally aligns with token-based reasoning, supports fine-grained yet executable action control, and avoids the distributional mismatch of masking-based discrete diffusion. We further introduce a spherical viewpoint perturbation augmentation to enhance robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, ManiSkill, and a real-world Franka arm demonstrate that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average.

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

I asked my AI agent to design a product launch image. Here's what came back.

Dev.to

E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

I asked my AI agent to design a product launch image. Here's what came back.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer