Information-Theoretic Constraints for Continual Vision-Language-Action Alignment

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

共有:

Key Points

Info-VLA is an information-preserving continual learning framework for Vision-Language-Action models that aims to mitigate catastrophic forgetting by preserving cross-modal information structure.
It introduces Replay Anchor Contrastive Learning, which creates stable alignment anchors from a frozen teacher model to maintain cross-modal alignment in representation space.
It also employs Cross-Modal Mutual Information Maximization to preserve the dependency structure between visual and language representations via mutual information constraints.
The approach balances stability and plasticity to improve continual learning performance, demonstrated on the LIBERO benchmark with notable gains over existing methods in both retention and adaptation.
The results suggest that preserving historical alignment and cross-modal dependencies can lead to stronger continual learning for open-ended robotic VLA tasks.

Abstract

When deployed in open-ended robotic environments, Vision--Language--Action (VLA) models need to continually acquire new skills, yet suffer from severe catastrophic forgetting. We observe that this degradation is related to the deterioration of cross-modal information structure, where dependencies among visual observations, language instructions, and actions progressively diffuse during continual adaptation. But existing continual learning methods fail to preserve such cross-modal information dependencies. Thus, we propose Info-VLA, an information-preserving continual learning framework that maintains cross-modal information structure through two complementary constraints. Replay Anchor Contrastive Learning constructs stable alignment anchors from a frozen teacher model, preserving cross-modal alignment in the representation space. Cross-Modal Mutual Information Maximization further preserves dependency structure between visual and language representations through mutual information constraints. By jointly preserving historical alignment and cross-modal dependency information, Info-VLA balances stability and plasticity during continual learning. Furthermore, experiments on the LIBERO show that Info-VLA significantly outperforms existing methods in both task retention and adaptation.

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

THE DECODER

How to Choose the Best AI Chat Models of 2026 for Your Business Needs

Dev.to

I built an AI that generates lesson plans in your exact teaching voice (open source)

Dev.to

6-Band Prompt Decomposition: The Complete Technical Guide

Dev.to

Information-Theoretic Constraints for Continual Vision-Language-Action Alignment

Key Points

Abstract

Related Articles

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

How to Choose the Best AI Chat Models of 2026 for Your Business Needs

I built an AI that generates lesson plans in your exact teaching voice (open source)

6-Band Prompt Decomposition: The Complete Technical Guide

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer