From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

arXiv cs.RO / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key embodied-intelligence challenge: aligning high-level semantic intent with low-level physical control despite spatiotemporal scale mismatch.
It argues that current generative VLA policies using “generation-from-noise” can be inefficient and struggle with condition alignment during optimization.
It introduces ResVLA, shifting to a “refinement-from-intent” paradigm by using spectral analysis to split control into a deterministic low-frequency anchor (intent) and a stochastic high-frequency residual (local dynamics).
The method anchors generation on predicted intent and uses a residual diffusion bridge to refine local behavior, improving training efficiency.
Experiments show competitive simulation results, strong robustness to language and embodiment perturbations, faster convergence, and strong performance in real-world robot tests.

Abstract

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.