Decomposing the Depth Profile of Fine-Tuning

arXiv cs.LG / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines how fine-tuning changes a model’s internal “depth profile” across 240 runs involving 15 models from four architecture families and parameter scales from 125M to 6.9B.
  • Across nearly all standard fine-tuning runs, representational change concentrates in output-proximal layers, suggesting a common locality pattern, though one notable exception is observed.
  • The authors introduce a per-layer control that equalizes relative weight updates (||ΔW||/||W||) after each optimizer step, showing that the depth-profile behavior can persist under some settings but collapse under others.
  • Results indicate architectural differences: sequential-block architectures retain certain profile slopes across more objectives, while parallel-block architectures do so only for causal-language-modeling objectives, with the distinction narrowing at ~1.3B–1.4B.
  • Under standard training, the depth-profile shape is characterized by two additional axes: steepness correlates with a training-free objective distance measured at initialization, while profile width is largely determined by architecture.

Abstract

Fine-tuning adapts pretrained networks to new objectives. Whether the resulting depth profile of representational change reflects an intrinsic property of the model or the magnitude of gradient flow has not been tested directly. We measure this profile across 240 fine-tuning runs spanning 15 models in four architecture families (encoder and decoder transformers, a state-space model, and an RNN) at scales from 125M to 6.9B parameters. Representational change concentrates in output-proximal layers in every standard-training run except one. We apply a per-layer control that equalizes \|\Delta W\|/\|W\| across layers after each optimizer step. Under this control, the profile persists in some conditions and collapses in others. At 125M--350M, sequential-block architectures (BERT, OPT, GPT-2) retain the slope across tested objectives while parallel-block architectures (Pythia, CodeGen) retain it only for causal-language-modeling objectives. This architectural distinction narrows at 1.3B--1.4B, where both block types show positive equal-step slopes for CausalLM. Under standard training, profile shape is described by two additional axes: steepness tracks a training-free objective distance at initialization, and profile width is dominated by architecture. We treat the locality gradient, the depthwise slope of representational change, as a composite phenomenon whose components are scale-dependent.