Information-Theoretic Constraints for Continual Vision-Language-Action Alignment

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

共有:

Key Points

Info-VLA is an information-preserving continual learning framework for Vision-Language-Action models that aims to mitigate catastrophic forgetting by preserving cross-modal information structure.
It introduces Replay Anchor Contrastive Learning, which creates stable alignment anchors from a frozen teacher model to maintain cross-modal alignment in representation space.
It also employs Cross-Modal Mutual Information Maximization to preserve the dependency structure between visual and language representations via mutual information constraints.
The approach balances stability and plasticity to improve continual learning performance, demonstrated on the LIBERO benchmark with notable gains over existing methods in both retention and adaptation.
The results suggest that preserving historical alignment and cross-modal dependencies can lead to stronger continual learning for open-ended robotic VLA tasks.

Abstract

When deployed in open-ended robotic environments, Vision--Language--Action (VLA) models need to continually acquire new skills, yet suffer from severe catastrophic forgetting. We observe that this degradation is related to the deterioration of cross-modal information structure, where dependencies among visual observations, language instructions, and actions progressively diffuse during continual adaptation. But existing continual learning methods fail to preserve such cross-modal information dependencies. Thus, we propose Info-VLA, an information-preserving continual learning framework that maintains cross-modal information structure through two complementary constraints. Replay Anchor Contrastive Learning constructs stable alignment anchors from a frozen teacher model, preserving cross-modal alignment in the representation space. Cross-Modal Mutual Information Maximization further preserves dependency structure between visual and language representations through mutual information constraints. By jointly preserving historical alignment and cross-modal dependency information, Info-VLA balances stability and plasticity during continual learning. Furthermore, experiments on the LIBERO show that Info-VLA significantly outperforms existing methods in both task retention and adaptation.

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search)

Reddit r/LocalLLaMA

#2 : プロンプト研究講座【第17回】プロンプトの「温度感」と「湿度感」の表現

note

菊地康巳「AIとぼくの研究日記」

note

Information-Theoretic Constraints for Continual Vision-Language-Action Alignment

Key Points

Abstract

Related Articles

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search)

#2 : プロンプト研究講座【第17回】プロンプトの「温度感」と「湿度感」の表現

菊地康巳「AIとぼくの研究日記」

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer