XSkill: Continual Learning from Experience and Skills in Multimodal Agents
arXiv cs.AI / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The authors identify two reusable knowledge streams—experiences for action-level tool selection and decisions, and skills for task-level planning—needed for continual learning in multimodal agents without updating model parameters.
- XSkill grounds both streams in visual observations and uses visually grounded summarization and cross-rollout critique to distill experiences and skills during accumulation, then retrieves and adapts them during inference.
- Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently outperforms both tool-only and learning-based baselines and demonstrates superior zero-shot generalization.
- Analyses reveal the two knowledge streams play complementary roles in shaping agents' reasoning behaviors and better generalization across domains.




