When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper finds that guard models fine-tuned on fully benign data can completely lose safety alignment, not from adversarial attacks but from ordinary domain specialization.
  • Across three safety classifiers used as protection layers in agentic AI pipelines (LlamaGuard, WildGuard, and Granite Guardian), the failure is traced to the collapse of “latent safety geometry,” i.e., the representational boundary that separates harmful from benign.
  • In the worst case (Granite Guardian), refusal rate drops from 85% to 0%, CKA falls to zero, and 100% of outputs become ambiguous, with the authors attributing this to brittle, overly concentrated safety representations.
  • The authors propose Fisher-Weighted Safety Subspace Regularization (FW-SSR), which adds a training-time penalty based on Fisher-information-weighted, curvature-aware safety subspaces and an adaptive scaling factor to resolve task–safety gradient conflicts.
  • Geometry-based monitoring is emphasized: structural representation metrics (CKA, Fisher score) predict safety behavior more reliably than raw displacement measures, making them necessary for evaluating guard models in agentic deployments.

Abstract

A guard model fine-tuned on entirely benign data can lose all safety alignment -- not through adversarial manipulation, but through standard domain specialization. We demonstrate this failure across three purpose-built safety classifiers -- LlamaGuard, WildGuard, and Granite Guardian -- deployed as protection layers in agentic AI pipelines, and show that it originates in the destruction of latent safety geometry: the structured harmful -- benign representational boundary that guides classification. We extract per-layer safety subspaces via SVD on class-conditional activation differences and track how this boundary evolves under benign fine-tuning. Granite Guardian undergoes complete collapse -- refusal rate drops from 85\% to 0\%, CKA falls to zero, and 100\% of outputs become ambiguous -- a severity exceeding prior findings on general-purpose LLMs, explained by the specialization hypothesis: concentrated safety representations are efficient but catastrophically brittle. To mitigate this, we propose Fisher-Weighted Safety Subspace Regularization (FW-SSR), a training-time penalty combining (i) curvature-aware direction weights derived from diagonal Fisher information and (ii) an adaptive \lambda_t that scales with task-safety gradient conflict. FW-SSR recovers 75\% refusal on Granite Guardian (CKA = 0.983) and reduces WildGuard's Attack Success Rate to 3.6\% -- below the unmodified baseline -- by actively sharpening the safety subspace rather than merely anchoring it. Across all three models, structural representational geometry (CKA, Fisher score) predicts safety behavior more reliably than absolute displacement metrics, establishing geometry-based monitoring as a necessary component of guard model evaluation in agentic deployments.