CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

arXiv cs.RO / 4/24/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces CorridorVLA, a Vision-Language-Action (VLA) approach that injects spatial guidance explicitly by predicting sparse spatial anchors as incremental physical changes.
It uses these anchors to define a “corridor” (a tolerance region) in the training objective so action trajectories outside the allowed spatial evolution receive corrective gradients.
The method is designed to permit small deviations caused by contact variability and execution noise while still enforcing alignment with physically plausible spatial changes.
On the LIBERO-Plus benchmark, CorridorVLA shows consistent improvements across SmolVLA and GR00T, boosting success rates by 3.4%–12.4%, with GR00T-Corr reaching 83.21%.
The results suggest that interpretable, action-aligned physical constraints can complement (or replace) spatial information that is otherwise implicitly encoded in visual or latent representations.

Abstract

Vision--Language--Action (VLA) models often use intermediate representations to connect multimodal inputs with continuous control, yet spatial guidance is often injected implicitly through latent features. We propose

CorridorVLA

, which predicts sparse spatial anchors as incremental physical changes (e.g.,

\Delta

-positions) and uses them to impose an explicit tolerance region in the training objective for action generation. The anchors define a corridor that guides a flow-matching action head: trajectories whose implied spatial evolution falls outside it receive corrective gradients, while minor deviations from contacts and execution noise are permitted. On the more challenging LIBERO-Plus benchmark, CorridorVLA yields consistent gains across both SmolVLA and GR00T, improving success rate by

3.4\%

12.4\%

over the corresponding baselines; notably, our GR00T-Corr variant reaches a success rate of

83.21\%

. These results indicate that action-aligned physical cues can provide direct and interpretable constraints for generative action policies, complementing spatial guidance encoded in visual or latent forms. Code is available at https://github.com/corridorVLA.

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Dev.to

DeepSeek-V4 Runs on Huawei Ascend Chips at 85% Utilization — Here's What That Means for AI Infrastructure and Pricing

Dev.to

CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

Key Points

Abstract

Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

DeepSeek-V4 Runs on Huawei Ascend Chips at 85% Utilization — Here's What That Means for AI Infrastructure and Pricing

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer