DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles lifelong knowledge editing for Vision Language Models (VLMs), highlighting how sequential edits can cause catastrophic forgetting, degraded reasoning, and cross-modal misalignment.
  • It argues that existing VLM editing methods still operate in entangled shared representation spaces and therefore suffer structural interference, even when they use gated adapters, activation edits, or parameter merging.
  • The proposed Dynamic Subspace Concept Alignment (DSCA) decomposes the representation space into orthogonal semantic subspaces (via incremental clustering and PCA) and performs edits only within these transformed spaces to structurally isolate concepts.
  • DSCA freezes the base model and uses a multi-term loss to preserve task fidelity, enforce edit locality, and maintain cross-modal alignment, yielding reported gains in single-edit success and long-sequence stability.

Abstract

Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.