AI Navigate

Information-Theoretic Constraints for Continual Vision-Language-Action Alignment

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

Key Points

  • Info-VLA is an information-preserving continual learning framework for Vision-Language-Action models that aims to mitigate catastrophic forgetting by preserving cross-modal information structure.
  • It introduces Replay Anchor Contrastive Learning, which creates stable alignment anchors from a frozen teacher model to maintain cross-modal alignment in representation space.
  • It also employs Cross-Modal Mutual Information Maximization to preserve the dependency structure between visual and language representations via mutual information constraints.
  • The approach balances stability and plasticity to improve continual learning performance, demonstrated on the LIBERO benchmark with notable gains over existing methods in both retention and adaptation.
  • The results suggest that preserving historical alignment and cross-modal dependencies can lead to stronger continual learning for open-ended robotic VLA tasks.

Abstract

When deployed in open-ended robotic environments, Vision--Language--Action (VLA) models need to continually acquire new skills, yet suffer from severe catastrophic forgetting. We observe that this degradation is related to the deterioration of cross-modal information structure, where dependencies among visual observations, language instructions, and actions progressively diffuse during continual adaptation. But existing continual learning methods fail to preserve such cross-modal information dependencies. Thus, we propose Info-VLA, an information-preserving continual learning framework that maintains cross-modal information structure through two complementary constraints. Replay Anchor Contrastive Learning constructs stable alignment anchors from a frozen teacher model, preserving cross-modal alignment in the representation space. Cross-Modal Mutual Information Maximization further preserves dependency structure between visual and language representations through mutual information constraints. By jointly preserving historical alignment and cross-modal dependency information, Info-VLA balances stability and plasticity during continual learning. Furthermore, experiments on the LIBERO show that Info-VLA significantly outperforms existing methods in both task retention and adaptation.