Decomposing the Depth Profile of Fine-Tuning

arXiv cs.LG / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper examines how fine-tuning changes a model’s internal “depth profile” across 240 runs involving 15 models from four architecture families and parameter scales from 125M to 6.9B.
Across nearly all standard fine-tuning runs, representational change concentrates in output-proximal layers, suggesting a common locality pattern, though one notable exception is observed.
The authors introduce a per-layer control that equalizes relative weight updates (||ΔW||/||W||) after each optimizer step, showing that the depth-profile behavior can persist under some settings but collapse under others.
Results indicate architectural differences: sequential-block architectures retain certain profile slopes across more objectives, while parallel-block architectures do so only for causal-language-modeling objectives, with the distinction narrowing at ~1.3B–1.4B.
Under standard training, the depth-profile shape is characterized by two additional axes: steepness correlates with a training-free objective distance measured at initialization, while profile width is largely determined by architecture.

Abstract

Fine-tuning adapts pretrained networks to new objectives. Whether the resulting depth profile of representational change reflects an intrinsic property of the model or the magnitude of gradient flow has not been tested directly. We measure this profile across 240 fine-tuning runs spanning 15 models in four architecture families (encoder and decoder transformers, a state-space model, and an RNN) at scales from 125M to 6.9B parameters. Representational change concentrates in output-proximal layers in every standard-training run except one. We apply a per-layer control that equalizes

\|\Delta W\|/\|W\|

across layers after each optimizer step. Under this control, the profile persists in some conditions and collapses in others. At 125M--350M, sequential-block architectures (BERT, OPT, GPT-2) retain the slope across tested objectives while parallel-block architectures (Pythia, CodeGen) retain it only for causal-language-modeling objectives. This architectural distinction narrows at 1.3B--1.4B, where both block types show positive equal-step slopes for CausalLM. Under standard training, profile shape is described by two additional axes: steepness tracks a training-free objective distance at initialization, and profile width is dominated by architecture. We treat the locality gradient, the depthwise slope of representational change, as a composite phenomenon whose components are scale-dependent.

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA

Decomposing the Depth Profile of Fine-Tuning

Key Points

Abstract

Related Articles

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer