Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance

arXiv cs.LG / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ローラ(LoRA)アダプタの重み差分を層ごとのスペクトル特徴(安定ランク、特異値エントロピー、実効ランク、健全セントロイドとのベクトル整合など)で要約すると、微調整の学習目的(例:DPOの反転設定等)を高精度に識別できることが示された。
  • Llama-3.2-3B-Instruct 上の事前登録実験で、同一手法(DPO内)では目的の二値/多クラス識別や序数の重み付け順位付けがほぼ完全(AUC~1.00、ρ≥0.956)に達し、学習時間とは独立な主成分(PC1)が目的情報を表す結果となった。
  • ただし手法をまたいだ汎化は失敗し、たとえばDPO学習器は他手法由来のステアリング系アダプタを正しくドリフトとして検出できなかった(AUC~0.00)。
  • 行動評価では、DPO-inverted-harmlessness が HEx-PHI プロンプトに対する有害コンプライアンス(ASR 0.266 vs. 健全 0.112)を有意に上昇させ、強度の用量-反応関係も高い相関(ρ=0.986)で観測された。
  • スペクトル幾何(geometry)と有害コンプライアンスの順位相関も一定程度成立するが(ρ=0.72)、クロス手法監視には手法別のキャリブレーションが必要だと結論づけている。

Abstract

We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC~1.00 on binary drift detection, all six pairwise objective comparisons, and near-perfect ordinal severity ranking (\rho \geq 0.956). Principal component analysis on flattened weight deltas reveals that training objective is PC1 (AUC~1.00 for objective separation), orthogonal to training duration on PC2. Query-projection weights detect that drift occurred; value-projection weights identify which objective. Cross-method generalization fails completely: a DPO-trained classifier assigns every steering adapter a lower drift score than every DPO adapter (AUC~0.00). In a behavioral evaluation phase, DPO-inverted-harmlessness adapters show elevated harmful compliance on HEx-PHI prompts (mean ASR 0.266 vs.\ healthy 0.112, \Delta = +0.154), with near-perfect dose--response (\rho = 0.986). The geometry-to-behavior rank correlation is \rho = 0.72 across 24 non-steered adapters. These results establish that within a controlled manufacturing regime, LoRA weight-space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and that cross-method monitoring requires per-method calibration.