Stable Language Guidance for Vision-Language-Action Models
arXiv cs.RO / 4/21/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- Vision-Language-Action (VLA) robotic models can fail under small linguistic changes due to a “modality collapse,” where strong visual priors drown out sparse language signals and the agent overfits to exact phrasing.
- The paper introduces Residual Semantic Steering (RSS), which probabilistically separates physical affordance from semantic execution to make actions follow intent rather than wording artifacts.
- RSS adds two components: Monte Carlo Syntactic Integration to approximate a better semantic posterior using LLM-driven distributional expansion, and Residual Affordance Steering to subtract visual affordance influence during decoding.
- Theoretical analysis claims RSS increases mutual information between action and intent while suppressing visual distractors, and experiments show state-of-the-art robustness on multiple manipulation benchmarks, including adversarial linguistic perturbations.
- The authors release the code for RSS on GitHub, enabling direct reproduction and further testing.
Related Articles

A practical guide to getting comfortable with AI coding tools
Dev.to

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

🚀 Major BrowserAct CLI Update
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to