AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
arXiv cs.CL / 4/17/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that continual visual question answering (VQA) methods designed for symmetric, unimodal models fail for modern vision-language models (VLMs) because their trainable parts are inherently asymmetric.
- It explains that this asymmetry makes standard global regularization overly optimize the large language decoder, while leaving smaller but crucial visual projection layers more exposed to interference and thus more prone to catastrophic forgetting.
- The proposed method, Asymmetric Information Masking (AIM), improves stability-plasticity trade-offs by applying modality-specific, targeted masks based on sensitivity to better protect vulnerable components.
- Experiments on VQA v2 and GQA in continual VQA settings show AIM delivers state-of-the-art average performance and reduced average forgetting, and better retains compositional generalization to new skill-concept combinations.


![[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Flu4b6ttuhur71z5gemm0.png&w=3840&q=75)
