AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
arXiv cs.RO / 4/14/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces AIM, an intent-aware unified world action model that addresses a mismatch between video-based world modeling (scene evolution) and action generation (where/how to interact with intent).
- AIM uses an explicit spatial interface by predicting an aligned spatial value map, routing future information to the action branch through value representations rather than decoding directly from future visuals.
- The method builds on pretrained video generation with a mixture-of-transformers shared architecture and employs intent-causal attention to isolate relevant future cues for action.
- It adds a self-distillation reinforcement learning stage that freezes the video and value branches, optimizing only the action head using dense rewards from projected value-map responses plus sparse task-level signals.
- On the RoboTwin 2.0 benchmark, AIM reportedly reaches a 94.0% average success rate and shows larger gains for long-horizon and contact-sensitive manipulation tasks, supported by a new 30K-trajectory simulation dataset with value-map annotations.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning
Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale
Dev.to

MCPNest - I built an MCP server marketplace in 7 days.
Dev.to