Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP

arXiv cs.LG / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study (arXiv:2604.16410) compares Full Fine-Tuning (Full FT) and LoRA on CLIP using a controlled matched-learning-rate grid to avoid confounded learning-rate conventions.
Results show learning rate strongly affects attention drift and representation structure: on EuroSAT, Full FT transitions from mild entropy broadening at 1e-6 to marked entropy contraction at 5e-5, while LoRA stays entropy-positive across the matched range.
At matched learning rates, LoRA preserves far more out-of-domain transfer than Full FT, achieving about 45.13% vs 11.28% (CIFAR-100 zero-shot) on EuroSAT and 58.01% vs 8.54% on Pets.
The paper finds a “regime” effect on Oxford-IIIT Pets: low-learning-rate LoRA can underfit in-domain, meaning method-only averages may hide the conditions where LoRA becomes competitive.
The authors argue that matched-learning-rate evaluation materially changes how to interpret Full FT vs LoRA, and that attention-drift metrics are most valuable as descriptive diagnostics of representation preservation rather than causal drivers of transfer.

Abstract

CLIP adaptation can improve in-domain accuracy while degrading out-of-domain transfer, but comparisons between Full Fine-Tuning (Full FT) and LoRA are often confounded by different learning-rate conventions. We study how adaptation method and optimization scale jointly shape attention drift and transfer retention in CLIP using a controlled matched-learning-rate comparison of Full FT and LoRA. The completed matrix contains 80 runs on CLIP ViT-B/32 across EuroSAT and Oxford-IIIT Pets, spanning four shared learning rates (

10^{-6}

5{\times}10^{-6}

10^{-5}

5{\times}10^{-5}

) and five seeds, and evaluates attention-drift metrics, best validation accuracy, and adapter-aware CIFAR-100 zero-shot accuracy. Learning rate strongly modulates structural change: on EuroSAT, Full FT moves from mild entropy broadening at

10^{-6}

to marked contraction at

5{\times}10^{-5}

, whereas LoRA remains entropy-positive across the full matched grid. At matched learning rates, LoRA preserves substantially more zero-shot transfer than Full FT, averaging

45.13\%

versus

11.28\%

CIFAR-100 accuracy on EuroSAT and

58.01\%

versus

8.54\%

on Pets. Oxford-IIIT Pets also reveals a regime effect: low-learning-rate LoRA underfits in-domain, so method-only averages can obscure when LoRA becomes competitive. Supporting rollout, patch-to-patch, and CKA analyses are directionally consistent with the controlled matrix. Overall, matched-learning-rate evaluation materially changes the interpretation of Full FT versus LoRA, and attention drift is most useful as a descriptive diagnostic of representation preservation rather than a causal explanation of transfer behavior.

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA

Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP

Key Points

Abstract

Related Articles

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer