CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

arXiv cs.LG / 3/19/2026

📰 NewsModels & Research

共有:

Key Points

CARE introduces covariance-aware and rank-enhanced MLA conversion to preserve KV-cache size while boosting expressivity.
It uses activation-preserving factorization, adjusted-rank allocation, and KV-parity mapping to align approximations with activations and allocate capacity where needed.
Evaluation on Qwen3-4B and Llama-3.1 shows up to 215x reduction in one-shot perplexity and up to 1.70x improvement in mean accuracy at matched KV budgets.
A brief post-SVD healing fine-tune fully recovers the original model's accuracy.

Abstract

Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model's accuracy.

Chip Smuggling Arrests, OpenClaw Is 'The Next ChatGPT,' and 81K People on AI

Dev.to

The Lemma

Dev.to

Your Agent Will Eventually Do Something Catastrophic. Here's How to Prevent It.

Dev.to

[D] Modeling online discourse escalation as a state machine (dataset + labeling approach)

Reddit r/MachineLearning

[R] Is this paper Nonsense ? [DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection]

Reddit r/MachineLearning

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Key Points

Abstract

Related Articles

Chip Smuggling Arrests, OpenClaw Is 'The Next ChatGPT,' and 81K People on AI

The Lemma

Your Agent Will Eventually Do Something Catastrophic. Here's How to Prevent It.

[D] Modeling online discourse escalation as a state machine (dataset + labeling approach)

[R] Is this paper Nonsense ? [DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer