Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding
arXiv cs.RO / 4/27/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Existing vision-language-action (VLA) models often entangle perception and control in a single pipeline, weakening language-conditioned grounding and causing failures in real-world tabletop settings such as over-grasping absent targets and getting distracted by clutter.
- The paper introduces OBEYED-VLA, which disentangles perception grounding from action reasoning by adding object-centric, geometry-aware grounding over multi-view inputs before feeding a pretrained VLA policy.
- OBEYED-VLA uses a VLM-based stage to select task-relevant object regions across cameras, paired with a geometric grounding stage that prioritizes 3D structure over appearance.
- The approach is fine-tuned on single-object demonstrations collected without clutter, and on a UR10e tabletop setup it significantly improves robustness across multiple hard regimes including distractors, absent-target rejection, background changes, and cluttered manipulation of unseen objects.
- Ablation results show that both semantic (object-centric) grounding and geometry-aware grounding are essential for the observed performance gains and better generalization.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

GET Serves Cache, POST Runs Inference: Cost Safety for a Public LLM Endpoint
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to